Closed noobistz closed 1 year ago
Hi @noobistz, I cannot reproduce the bug. Are you using the last version of the package? Are you using particular options for the extraction? It would also help if you copy the full error message in order to assess where the potential bug happens.
Hi @adbar , i am using package version 1.6.2.
Below is the method i used.
downloaded = trafilatura.fetch_url("https://blog.sekoia.io/active-lycantrox-infrastructure-illumination/")
text = trafilatura.extract(downloaded)
The full error message as follows:
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:782, in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, csv_output, json_output, xml_output, tei_output, tei_validation, target_language, include_tables, include_images, include_formatting, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, settingsfile, config)
780 url_blacklist = set()
781 # extraction
--> 782 docmeta = bare_extraction(
783 filecontent, url=url, no_fallback=no_fallback,
784 include_comments=include_comments, output_format=output_format,
785 target_language=target_language, include_tables=include_tables, include_images=include_images,
786 include_formatting=include_formatting, deduplicate=deduplicate,
787 date_extraction_params=date_extraction_params, with_metadata=with_metadata,
788 max_tree_size=max_tree_size, url_blacklist=url_blacklist, config=config,
789 )
790 if docmeta is None:
791 return None
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:685, in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
682 # compare if necessary
683 if no_fallback is False:
684 #if sure_thing is False:
--> 685 postbody, temp_text, len_text = compare_extraction(tree, backup_tree, url, postbody, temp_text, len_text, target_language, include_formatting, config)
686 else:
687 # rescue: try to use original/dirty tree
688 if sure_thing is False and len_text < config.getint('DEFAULT', 'MIN_EXTRACTED_SIZE'):
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/core.py:473, in compare_extraction(tree, backup_tree, url, body, text, len_text, target_language, include_formatting, config)
471 # override faulty extraction # len_text < MIN_EXTRACTED_SIZE*10
472 if body.xpath(SANITIZED_XPATH):
--> 473 body2, text2, len_text2, jt_result = justext_rescue(tree, url, target_language, body, 0, '')
474 if jt_result is True: # and not len_text > 2*len_text2:
475 LOGGER.debug('using justext, length: %s', len_text2) #MIN_EXTRACTED_SIZE:
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:95, in justext_rescue(tree, url, target_language, postbody, len_text, text)
93 '''Try to use justext algorithm as a second fallback'''
94 result_bool = False
---> 95 temppost_algo = try_justext(tree, url, target_language)
96 if temppost_algo is not None:
97 temp_text = trim(' '.join(temppost_algo.itertext()))
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:80, in try_justext(tree, url, target_language)
78 # extract
79 try:
---> 80 paragraphs = custom_justext(tree, justext_stoplist)
81 except ValueError as err: # not an XML element: HtmlComment
82 LOGGER.error('justext %s %s', err, url)
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/trafilatura/external.py:65, in custom_justext(tree, stoplist)
63 dom = preprocessor(tree) # tree_cleaning(tree, True)
64 paragraphs = ParagraphMaker.make_paragraphs(dom)
---> 65 classify_paragraphs(paragraphs, stoplist, 50, 200, 0.1, 0.2, 0.2, True)
66 revise_paragraph_classification(paragraphs, 200)
67 return paragraphs
File ~/anaconda3/envs/dl/lib/python3.10/site-packages/justext/core.py:249, in classify_paragraphs(paragraphs, stoplist, length_low, length_high, stopwords_low, stopwords_high, max_link_density, no_headings)
243 def classify_paragraphs(paragraphs, stoplist, length_low=LENGTH_LOW_DEFAULT,
244 length_high=LENGTH_HIGH_DEFAULT, stopwords_low=STOPWORDS_LOW_DEFAULT,
245 stopwords_high=STOPWORDS_HIGH_DEFAULT, max_link_density=MAX_LINK_DENSITY_DEFAULT,
246 no_headings=NO_HEADINGS_DEFAULT):
247 "Context-free paragraph classification."
--> 249 stoplist = define_stoplist(stoplist)
250 for paragraph in paragraphs:
251 length = len(paragraph)
TypeError: unhashable type: 'set'
Clicked on the wrong button oops, re-opening.
Thanks, I still can't reproduce the issue. It seems to be related to the justext
dependency and/or the way you installed it with anaconda, is it the latest version (v3) ?
It seems like after using a brand new environment, the error went away. Thanks for your help!
Hi there,
This is an incredible package!
I was unable to extract text from this particular site. The error shown is
TypeError: unhashable type: 'set'
.How can I go about resolving this?