adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.65k stars 261 forks source link

Unable to extract text from a given site, TypeError: unhashable type: 'set' #426

Closed noobistz closed 1 year ago

noobistz commented 1 year ago

Hi there,

This is an incredible package!

I was unable to extract text from this particular site. The error shown is TypeError: unhashable type: 'set'.

How can I go about resolving this?

adbar commented 1 year ago

Hi @noobistz, I cannot reproduce the bug. Are you using the last version of the package? Are you using particular options for the extraction? It would also help if you copy the full error message in order to assess where the potential bug happens.

noobistz commented 1 year ago

Hi @adbar , i am using package version 1.6.2.

Below is the method i used.

downloaded = trafilatura.fetch_url("https://blog.sekoia.io/active-lycantrox-infrastructure-illumination/")
text = trafilatura.extract(downloaded)

The full error message as follows:

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:782, in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, csv_output, json_output, xml_output, tei_output, tei_validation, target_language, include_tables, include_images, include_formatting, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, settingsfile, config)
    780     url_blacklist = set()
    781 # extraction
--> 782 docmeta = bare_extraction(
    783     filecontent, url=url, no_fallback=no_fallback,
    784     include_comments=include_comments, output_format=output_format,
    785     target_language=target_language, include_tables=include_tables, include_images=include_images,
    786     include_formatting=include_formatting, deduplicate=deduplicate,
    787     date_extraction_params=date_extraction_params, with_metadata=with_metadata,
    788     max_tree_size=max_tree_size, url_blacklist=url_blacklist, config=config,
    789     )
    790 if docmeta is None:
    791     return None

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:685, in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
    682 # compare if necessary
    683 if no_fallback is False:
    684     #if sure_thing is False:
--> 685     postbody, temp_text, len_text = compare_extraction(tree, backup_tree, url, postbody, temp_text, len_text, target_language, include_formatting, config)
    686 else:
    687     # rescue: try to use original/dirty tree
    688     if sure_thing is False and len_text < config.getint('DEFAULT', 'MIN_EXTRACTED_SIZE'):

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:473, in compare_extraction(tree, backup_tree, url, body, text, len_text, target_language, include_formatting, config)
    471 # override faulty extraction # len_text < MIN_EXTRACTED_SIZE*10
    472 if body.xpath(SANITIZED_XPATH):
--> 473     body2, text2, len_text2, jt_result = justext_rescue(tree, url, target_language, body, 0, '')
    474     if jt_result is True: # and not len_text > 2*len_text2:
    475         LOGGER.debug('using justext, length: %s', len_text2)  #MIN_EXTRACTED_SIZE:

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:95, in justext_rescue(tree, url, target_language, postbody, len_text, text)
     93 '''Try to use justext algorithm as a second fallback'''
     94 result_bool = False
---> 95 temppost_algo = try_justext(tree, url, target_language)
     96 if temppost_algo is not None:
     97     temp_text = trim(' '.join(temppost_algo.itertext()))

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:80, in try_justext(tree, url, target_language)
     78 # extract
     79 try:
---> 80     paragraphs = custom_justext(tree, justext_stoplist)
     81 except ValueError as err:  # not an XML element: HtmlComment
     82     LOGGER.error('justext %s %s', err, url)

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:65, in custom_justext(tree, stoplist)
     63 dom = preprocessor(tree) # tree_cleaning(tree, True)
     64 paragraphs = ParagraphMaker.make_paragraphs(dom)
---> 65 classify_paragraphs(paragraphs, stoplist, 50, 200, 0.1, 0.2, 0.2, True)
     66 revise_paragraph_classification(paragraphs, 200)
     67 return paragraphs

File ~/anaconda3/envs/dl/lib/python3.10/site-packages/justext/core.py:249, in classify_paragraphs(paragraphs, stoplist, length_low, length_high, stopwords_low, stopwords_high, max_link_density, no_headings)
    243 def classify_paragraphs(paragraphs, stoplist, length_low=LENGTH_LOW_DEFAULT,
    244         length_high=LENGTH_HIGH_DEFAULT, stopwords_low=STOPWORDS_LOW_DEFAULT,
    245         stopwords_high=STOPWORDS_HIGH_DEFAULT, max_link_density=MAX_LINK_DENSITY_DEFAULT,
    246         no_headings=NO_HEADINGS_DEFAULT):
    247     "Context-free paragraph classification."
--> 249     stoplist = define_stoplist(stoplist)
    250     for paragraph in paragraphs:
    251         length = len(paragraph)

TypeError: unhashable type: 'set'
noobistz commented 1 year ago

Clicked on the wrong button oops, re-opening.

adbar commented 1 year ago

Thanks, I still can't reproduce the issue. It seems to be related to the justext dependency and/or the way you installed it with anaconda, is it the latest version (v3) ?

noobistz commented 1 year ago

It seems like after using a brand new environment, the error went away. Thanks for your help!