flairNLP / fundus

A very simple news crawler with a funny name
MIT License
292 stars 75 forks source link

[Bug] A SZ Article crashes fundus during ld construction #193

Closed Weyaaron closed 1 year ago

Weyaaron commented 1 year ago

The following code snippet crashes fundus while the ld is constructed:

from fundus.publishers.de import SZParser
from fundus.scraping.scraper import Scraper
from fundus.scraping.source import StaticSource

test_source = StaticSource([
    "https://www.sueddeutsche.de/projekte/artikel/politik/bremen-bremerhaven-wahl-protokolle-e142075/"])

scraper = Scraper(test_source, parser=SZParser())

for article in scraper.scrape(error_handling='raise'):
    print(article)

The traceback is below:

Traceback (most recent call last):
  File "/home/aaron/Code/Python/Fundus/create_test.py", line 14, in <module>
    for article in scraper.scrape(error_handling='raise'):
  File "/home/aaron/Code/Python/Fundus/src/fundus/scraping/scraper.py", line 89, in scrape
    raise err
  File "/home/aaron/Code/Python/Fundus/src/fundus/scraping/scraper.py", line 82, in scrape
    extraction = self.parser.parse(article_source.html, error_handling)
  File "/home/aaron/Code/Python/Fundus/src/fundus/parser/base_parser.py", line 191, in parse
    self._base_setup(html)
  File "/home/aaron/Code/Python/Fundus/src/fundus/parser/base_parser.py", line 187, in _base_setup
    self.precomputed = Precomputed(html, doc, get_meta_content(doc), LinkedDataMapping(collapsed_lds))
  File "/home/aaron/Code/Python/Fundus/src/fundus/parser/data.py", line 47, in __init__
    self.add_ld(ld)
  File "/home/aaron/Code/Python/Fundus/src/fundus/parser/data.py", line 58, in add_ld
    raise ValueError(f"Found no type for LD")
ValueError: Found no type for LD

Process finished with exit code 1
MaxDall commented 1 year ago

Hey, thanks for reporting this. As far as I can tell everything works as intended from Fundus site. The LinkedData extracted from the site you want to parse contains no @type property, so Fundus is unable to parse it properly and raises an error. Calling crawl() with error_handling set too raise will reraise the error. If you want to skip these articles you should change the error_handling parameter.

Weyaaron commented 1 year ago

Alright, this seems fine to me, I appreciate the quick response.