adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.66k stars 262 forks source link

Review input type for `is_probably_readerable()` function #749

Open adbar opened 3 days ago

adbar commented 3 days ago

The options passed to this functions should be of the Extractor type and not Any or dict, the tests have to be rewritten accordingly:

def is_probably_readerable(html: HtmlElement, options: Optional[Extractor] = None) -> bool:
...
    if options:
        option_dict = {attr: getattr(options, attr, None) for attr in options.__slots__}
    else:
        option_dict = {}