adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

Trafilatura crashing due to `options` variable not backfilled yet #705

Closed rgeronimi closed 1 month ago

rgeronimi commented 1 month ago

I have built a crawler using Trafilatura through its bare_extraction function.

On rare URLs, the crawler triggers a Trafilatura extraction error.

However this extraction error triggers a Trafilatura crash, as it tries to log the options.source field which is not set yet if this function is called with no options argument:

https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L247

Because, in this try...except block, the options variable backfilling code : https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L156

Is called after the load_html call that had the initial extraction error: https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L150

The solution is to invert the order of these last 2 blocks, so that the options variable is backfilled before load_html is called and raises its error.

adbar commented 1 month ago

@dmoklaf Thanks for the detailed report! I think you are right about the solution, could you please draft a pull request?