Extractor problem - Githubissues

PeterDaveHello commented 3 years ago

Hi there,

With all due respect, fullyfeedly seem to be a very awesome browser extension, can help saving time and focus on the interested content, with less browser tabs switching!

I just noticed that there is an issue: the recommended Mercury extractor by default isn't powerful enough to work on many websites, looks like it need many custom extractor/parser to deal with different websites, the non-default Boilerpipe is very powerful, but not only the limited quota issue mentioned in the README.md, I also found that the request from fullyfeedly to Boilerpipe web service will face CORS error issues, which means it's not working right now, combined the different situation together, fullyfeedly will only be 100% working on limited websites.

Not sure if it's because the websites I frequently visit can't be properly parsed by Mercury is a coincidence, but I do compare the extracted result with Boilerpipe's, Boilerpipe works pretty better, in contrast, Mercury sometimes just extracted not meaningful html tags.

For the first part, I guess I can only write custom extractors and send pull requests to Mercury, but it could really consumed time, and not pretty scalable.

For second part: I've opened an issue at https://github.com/kohlschutter/boilerpipe/issues/28, if anyone is also looking for a workaround, here it is: https://add0n.com/access-control.html (CORS Unblock).

Not sure if there is anything we can do to help improve the issue, will hosting an individual Boilerpipe web service be a considerable option? Or it's better to find some alternatives?

Thanks a lot!

Muffo commented 3 years ago

Thanks for raising this issue, I was not aware of the problem with boilerpipe. I suspect that by deploying this code on a different cloud service we will run into the same quota limits, unless we decide to pay for increased capacity. On the other hand, I am happy to consider other free APIs in case there are new solutions released after I created this extension.

PeterDaveHello commented 3 years ago

What if add a built-in parser as another choice? Like: https://github.com/ndaidong/article-parser, https://github.com/Tjatse/node-readability & https://github.com/mozilla/readability, and maybe an option to support setting customized Mercury/Boilerpipe service url for self-hosted service could help tolerate the quota issue. (Don't know if it's easy to setup one, yet.)

Muffo commented 3 years ago

That's another possibility, but this change would significantly affect the permissions of this extension. See my comment on https://github.com/Muffo/fullyfeedly/pull/38.

If we really want to use a built-in parser, I would prefer to create a new extension and avoid disruption for existing users.

PeterDaveHello commented 3 years ago

Looks like the GitHub repo of boilerpipe is no longer under maintenance, the issues and pull requests are just stale, and the web version started to return 500 error for a while, also saw a bunch of tweets asking about boilerpipe but got no response. Maybe should just consider to remove it for now?

Muffo commented 3 years ago

Thanks for the inputs! I haven't checked boilerpipe in a while, but I agree it could be removed if not working properly.

When we do that, we should make sure users are automatically/transparently moved to a different extractor.

Muffo / fullyfeedly

Extractor problem #54