extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.56k stars 134 forks source link

Can i use with utf 8 ? #371

Closed triay0 closed 11 months ago

triay0 commented 11 months ago

Thanks for this amazing package, would it be possible to get content with utf8, spanish accents are not recognized

image
ndaidong commented 11 months ago

@triay0 this website doesn't not use UTF-8 but another charset.

In order to get the correct utf8 characters from such pages, you can fetch the HTML and decode them before passing into article-extractor's extractFromHtml, as below:

  const res = await fetch(url)
  const buffer = await res.arrayBuffer()
  const decoder = new TextDecoder('iso-8859-1')
  const html = decoder.decode(buffer)

  const art = await extractFromHtml(html)
  console.log(art)

Screenshot from 2023-10-11 19-08-33