adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.45k stars 252 forks source link

Crash on a specific web page #140

Closed dmoklaf closed 2 years ago

dmoklaf commented 2 years ago

My code extracts (using my own spider framework) the HTML content of this page:

https://paperswithcode.com/paper/revisiting-deep-learning-models-for-tabular/review/

parses it with LXML and calls trafilaturata with the XML content tree (the reason I do not let Trafilatura handle these tasks is that my framework handles parallelism, content encoding edge cases, and most importantly disk caching):

extraction = trafilatura.core.bare_extraction(content_tree, include_comments=False)

This crashes ONLY ON THIS WEB PAGE with this stack trace:

...
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.9/site-packages/trafilatura/core.py", line 771, in bare_extraction
    docmeta = extract_metadata(tree, url, date_extraction_params, no_fallback, author_blacklist)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.9/site-packages/trafilatura/metadata.py", line 390, in extract_metadata
    metadata = extract_meta_json(tree, metadata)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.9/site-packages/trafilatura/metadata.py", line 74, in extract_meta_json
    metadata = extract_json(schema, metadata)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.9/site-packages/trafilatura/json_metadata.py", line 45, in extract_json
    if isinstance(content["@type"], list):
TypeError: string indices must be integers

which indicates that, in this specific case, the "content" variable contains a string and not a dictionary

adbar commented 2 years ago

Thanks, I can reproduce the bug.

@felipehertzer It seems we didn't test your PR thoroughly this summer, could you have a look at it?

adbar commented 2 years ago

@dmoklaf @felipehertzer I just made sure such errors are caught, it would be more elegant to fix them though.

felipehertzer commented 2 years ago

Hi @adbar @dmoklaf thanks for alerting me, I've performed the fix. The problem was that the page uses a different structure for array when there is only one item, and the code was expecting the structure with many items.