Open cayolblake opened 3 years ago
Hi @cayolblake - welcome 👋 . Are you using the Python library (uses our HTML simplification code) or calling from the command line with default options (uses Mozilla's Readability.js package to simplify the HTML)? Thanks.
Hi @martintoreilly
I'm using the Python library :)
@cayolblake I'm afraid that, if this isn't fixed by updating to the latest version of Mozilla's Readability.js
, then we wont' have the bandwidth to be able to look into it anytime soon. Sensibly extracting images and videos using our python based HTML simplifier is something we've talked about supporting before, but until we're next working on a project that's parsing web articles, we'll struggle to carve out time to work on this further.
I think for your use case, adding support in our python HTML simplifier won't be enough, as we're not currently as good as Readability.js in stripping out non-content elements so I think would not be suitable for you even if it did contain images and video tags.
I'm tagging this with a future
tag rather than closing it to keep it visible for when we're next working on this.
Linked to issue #31, which considers iframe
handling more generally.
Hi @martintoreilly
That's perfectly understood. I'm planning to take a dive into your project and understand how it works - any docs that can help explaining/simplifying things further would be appreciated - hopefully after doing so I'm be able to find that best candidate for applying modification if possible.
I think the Readability.js
main point of strength is that it gets burned on a daily basis from all Firefox users from everywhere which gives it the chance to enhance its heuristic algorithms as it goes.
Have you thought about splitting your own simplifier and the readability wrapper/utilizer in two different projects? I guess that could highlight more healthy focus on your own simplifier while still having something that works on its own dependably and may be use it as a reference or a benchmark? Just a humble thought 🤔
Hello,
Is there a way to allow extracting YouTube video and iframe tags similar to how image extraction is done?