Closed martinrotter closed 2 months ago
Btw, the issue can be fixed with using node-fetch and https://www.npmjs.com/package/fetch-charset-detection
@martinrotter thanks for your notice & suggestion, let me see if we can use that lib to resolve this issue.
@martinrotter your example helps me a lot. I've also checked how fetch-charset-detection
and its main dependency iconv-lite
work. Then I found that we don't need them at all. What we should do is detect the charset (from response header or meta tag), and then use native TextDecoder to decode the content. Something like below should work as expected:
import { extractFromHtml } from '@extractus/article-extractor'
const url = 'https://www.idnes.cz/ceske-budejovice/zpravy/chata-deti-hluboka-matka-straznici-myly-potok-ospod-les.A240422_135654_budejovice-zpravy_khr'
const res = await fetch(url);
const buffer = await res.arrayBuffer()
const decoder = new TextDecoder('windows-1250')
const html = decoder.decode(buffer)
const article = await extractFromHtml(html, url);
console.log(article)
I'm trying to improve the logic here.
Have sample URL: https://www.idnes.cz/ceske-budejovice/zpravy/chata-deti-hluboka-matka-straznici-myly-potok-ospod-les.A240422_135654_budejovice-zpravy_khr#utm_source=rss&utm_medium=feed&utm_campaign=zpravodaj&utm_content=main
Demo at https://extractor-demos.pages.dev/article-extractor returns content with malformed characters. Czech characters are completely broken.
Also self-deployed latest 8.0.7 version of your SW has the same problem.
Website www.idnes.cz runs its HTML files with this content-type: text/html; charset=windows-1250.
Here is live data returned directly by "extract" method: