alan-turing-institute / ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
MIT License
230 stars 36 forks source link

How to allow extracting YouTube videos or <iframe> tags? #93

Open cayolblake opened 3 years ago

cayolblake commented 3 years ago

Hello,

Is there a way to allow extracting YouTube video and iframe tags similar to how image extraction is done?

martintoreilly commented 3 years ago

Hi @cayolblake - welcome 👋 . Are you using the Python library (uses our HTML simplification code) or calling from the command line with default options (uses Mozilla's Readability.js package to simplify the HTML)? Thanks.

cayolblake commented 3 years ago

Hi @martintoreilly

I'm using the Python library :)

martintoreilly commented 3 years ago

@cayolblake I'm afraid that, if this isn't fixed by updating to the latest version of Mozilla's Readability.js, then we wont' have the bandwidth to be able to look into it anytime soon. Sensibly extracting images and videos using our python based HTML simplifier is something we've talked about supporting before, but until we're next working on a project that's parsing web articles, we'll struggle to carve out time to work on this further.

I think for your use case, adding support in our python HTML simplifier won't be enough, as we're not currently as good as Readability.js in stripping out non-content elements so I think would not be suitable for you even if it did contain images and video tags.

I'm tagging this with a future tag rather than closing it to keep it visible for when we're next working on this.

martintoreilly commented 3 years ago

Linked to issue #31, which considers iframe handling more generally.

cayolblake commented 3 years ago

Hi @martintoreilly

That's perfectly understood. I'm planning to take a dive into your project and understand how it works - any docs that can help explaining/simplifying things further would be appreciated - hopefully after doing so I'm be able to find that best candidate for applying modification if possible.

I think the Readability.js main point of strength is that it gets burned on a daily basis from all Firefox users from everywhere which gives it the chance to enhance its heuristic algorithms as it goes.

Have you thought about splitting your own simplifier and the readability wrapper/utilizer in two different projects? I guess that could highlight more healthy focus on your own simplifier while still having something that works on its own dependably and may be use it as a reference or a benchmark? Just a humble thought 🤔