I have built a crawler using Trafilatura through its bare_extraction function.
On rare URLs, the crawler triggers a Trafilatura extraction error.
However this extraction error triggers a Trafilatura crash, as it tries to log the options.source field which is not set yet if this function is called with no options argument:
The solution is to invert the order of these last 2 blocks, so that the options variable is backfilled before load_html is called and raises its error.
I have built a crawler using Trafilatura through its
bare_extraction
function.On rare URLs, the crawler triggers a Trafilatura extraction error.
However this extraction error triggers a Trafilatura crash, as it tries to log the
options.source
field which is not set yet if this function is called with nooptions
argument:https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L247
Because, in this
try...except
block, theoptions
variable backfilling code : https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L156Is called after the
load_html
call that had the initial extraction error: https://github.com/adbar/trafilatura/blob/f57ef0b64b4cf96904e377eb012ebb38f097c518/trafilatura/core.py#L150The solution is to invert the order of these last 2 blocks, so that the
options
variable is backfilled beforeload_html
is called and raises its error.