extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.6k stars 140 forks source link

Can i use with utf 8 ? #371

Closed triay0 closed 1 year ago

triay0 commented 1 year ago

Thanks for this amazing package, would it be possible to get content with utf8, spanish accents are not recognized

image
ndaidong commented 1 year ago

@triay0 this website doesn't not use UTF-8 but another charset.

In order to get the correct utf8 characters from such pages, you can fetch the HTML and decode them before passing into article-extractor's extractFromHtml, as below:

  const res = await fetch(url)
  const buffer = await res.arrayBuffer()
  const decoder = new TextDecoder('iso-8859-1')
  const html = decoder.decode(buffer)

  const art = await extractFromHtml(html)
  console.log(art)

Screenshot from 2023-10-11 19-08-33