extractus / article-extractor

To extract main article from given URL with Node.js
https://extractor-demos.pages.dev/article-extractor
MIT License
1.46k stars 132 forks source link

Some url do not work #365

Closed onigetoc closed 9 months ago

onigetoc commented 10 months ago

I do know why but this url is always not working with a lot of parsers and even some Chrome extensions do not parse it but some do. I try to find out what this url is particular and so hard to parse and extract.

https://www.journaldequebec.com/2023/08/05/information-boycottee-par-facebook--voici-la-solution

I may create a list here and update it when i will find new urls not working.

ndaidong commented 10 months ago

@onigetoc I received 403 status code from journaldequebec. This link is not accessible from my network. Maybe that's geo blocked?

Screenshot from 2023-09-03 10-20-48

onigetoc commented 10 months ago

OK, but i did try on VS Code and i'm in this province and my internet provider too. Québec, Canada. I tryed with a php pearser to grab articles and infos and it do not work either. I understand that it will be hard for you to really test it if it's also geo blocked. It's one of the bigest news media here with millions of users/visitors.