adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.39k stars 251 forks source link

Importing only the extract utilities #41

Closed Yomguithereal closed 3 years ago

Yomguithereal commented 3 years ago

Hello @adbar, thanks for your tremendous work on the library. Do you know if there is a way to install and then import the library so that you will only load the utilities related to raw content extraction from a html string? If not, is there anyway we can discuss this particular topic and see if I could help you implement this in any way? My use case is basically the following: I have a CLI tool that currently relies on dragnet and I would like to jump ship and adopt trafilatura. My issue is that I don't want to install the net-related dependencies you list in your setup.py (notably requests and tldextract) because they will clash with some of my dependencies and I have my own means of downloading things, dealing with urls etc.

Have a good day,

adbar commented 3 years ago

Hi, thanks for your interest. There is no easy way to specify a subset of dependencies in setup.py and I know quite a few users who rely on the integrated download function so I'm not keen on making it optional. If I understand properly you'd only use the bare_extraction function? Here are the options I see:

What do you think?

Yomguithereal commented 3 years ago

Hello @adbar, thanks for the answer.

If I understand properly you'd only use the bare_extraction function?

Exactly, I only need those core utilities.

I'm also thinking about replacing tldextract, maybe the solution you use is open-source and more efficient? In that case I'd like to know more about it ;)

I am currently relying on the tld python library, itself relying on this public facing list of TLDs (but you can also override it if required). Then I have my own routines similar to courlan there.

The only way I can think of to keep both current version non breaking for its current users and not requiring complex installation schemes is unfortunately to create a separate pypi package like trafilatura-extract for instance, that this package would depend on but I can see how this would be bothersome for you to do that.

For reference, the CLI tool I am speaking about is this tool called minet and developed by my research lab: https://github.com/medialab/minet

adbar commented 3 years ago

Hi, thanks for the links, it seems like we're working on similar topics, nice!

Replacing tldextract shouldn't be an issue here, tld seems to be a good candidate. I had it on my list anyway.

I think I could replace requests as well or at least make it optional by providing another download function. What is the dependency clash you face? And could you please point me to the place in your code where you perform the downloads?

Yomguithereal commented 3 years ago

Replacing tldextract shouldn't be an issue here, tld seems to be a good candidate. I had it on my list anyway.

What's the issues you are having with tldextract, just for the reference? On a side note, do we agree that the core extraction schemes don't require to have any knowledge of tld data?

What is the dependency clash you face?

Clash is maybe a strong word. I don't think installing your lib would crash the installation of some other dependencies I might have. My issue here is more down the line when I finally compile the tool as a standalone executable using pyinstaller or pyoxidizer because I cannot easily tell them not to package some dependencies that will not be actually used and this translates to longer starting times and larger binary sizes.

And could you please point me to the place in your code where you perform the downloads?

I currently use urllib3 (which is used by requests under the hood) but cannot rely on requests for performance and multithreading reasons. I will probably switch to pycurl in the future, though. I cannot point you towards the specific code because the codebase is quite large and the related code is not just a single function but most of it can be found here and here.

What I could do, as a first step, would be to switch from dragnet to trafilatura as is and see how much cruft it does entail regarding packaging/compilation and come back if there are "real" issues I cannot circumvent.

adbar commented 3 years ago

There is not real issue with tldextract but your original question is totally relevant, it's generally a good idea to reduce the total number of package dependencies. Since tld is self-sufficient it would be a good replacement.

I agree, the performance of requests is suboptimal. I'd recommend pycurl but didn't ship it with trafilatura since its dependence on libcurl doesn't make it so portable or easy to install.

OK, please try to use the package as is, I'm ready to help if there are packaging or compilation issues arise.

adbar commented 3 years ago

The download utility wasn't so complex anyway, I adapted it to use urllib3 directly in 273a319c1199d32f964d5402195dcb64d34a415f, so requests will be dropped from setup once the next version is released. I had to keep chardet though (with support for cchardet), just so you know.

Since tld only supports Python 3.6 upwards I'm going to wait a little to replace tldextract with it, just tell me if you run into problems.

Yomguithereal commented 3 years ago

Hello @adbar. I also use cchardet on my end anyway.

Since tld only supports Python 3.6 upwards I'm going to wait a little to replace tldextract with it, just tell me if you run into problems.

No problem so far apart from needing to load a suffix list twice once for tld and tldextract in my pyinstaller compilation but nothing critical.

That's weird it does not handle < 3.6 since it supports 2.7 though :)

adbar commented 3 years ago

The new package versions are out, requests has been dropped out.