Closed ShaRefOh closed 1 month ago
Full log:
back (most recent call last): File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 208, in <module> pred_labels(df=df,config=config) File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 54, in pred_labels results = model.batch_process_ref_posts(inputs=inputs,active_list=["keywords", "topics"],batch_size=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/parsers/multi_chain_parser.py", line 213, in batch_process_ref_posts md_dict = extract_posts_ref_metadata_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 160, in extract_posts_ref_metadata_dict md_dict = extract_all_metadata_to_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 129, in extract_all_metadata_to_dict md_list = extract_all_metadata_by_type( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 119, in extract_all_metadata_by_type return extract_urls_citoid_metadata(target_urls, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 101, in extract_urls_citoid_metadata return normalize_citoid_metadata(target_urls, metadatas_raw, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 30, in normalize_citoid_metadata metadata["original_url"] = url ~~~~~~~~^^^^^^^^^^^^^^^^ TypeError: 'ContentTypeError' object does not support item assignment
@ShaRefOh I tried this in the notebook and topics extraction did work - https://github.com/Common-SenseMakers/sensemakers/blob/44177802dbc934a8ddc2fe3ff969a514a4450d6d/nlp/notebooks/multi_chain_parser_example.ipynb
Does that notebook section work for you? (cell 6)
It looks like there is a problem running citoid at scale since there are a small fraction of failures that aren't currently handled properly. Working on a retry functionality now.
@ShaRefOh I pushed a fix to the nlp-dev branch, can you move there and try it? You should be able to do the batch code as usual.
@ronentk I tried it and got a different error:
ValidationError Traceback (most recent call last)
Cell In[7], line 2
1 # batch process
----> 2 results = multi_chain_parser.batch_process_ref_posts(inputs,active_list=['topics','keywords'],batch_size=10)
File ~/sensemakers/nlp/notebooks/../desci_sense/shared_functions/parsers/multi_chain_parser.py:259, in MultiChainParser.batch_process_ref_posts(self, inputs, batch_size, active_list)
257 post_processed_results = []
258 for post, result, prompts_dict in zip(inputs, results, inst_prompts):
--> 259 post_processed_res = self.post_process_raw_results(
260 post,
261 prompts_dict,
262 result,
263 md_dict,
264 self.config.post_process_type,
265 )
266 post_processed_results.append(post_processed_res)
268 logger.debug("Done!")
File ~/sensemakers/nlp/notebooks/../desci_sense/shared_functions/parsers/multi_chain_parser.py:155, in MultiChainParser.post_process_raw_results(self, post, inst_prompt_dict, raw_results, md_dict, post_process_type)
152 return raw_results
154 # convert raw outputs to combined format
--> 155 combined_res = post_process_chain_output(
156 post,
157 raw_results,
...
ValidationError: 1 validation error for CombinedParserOutput
item_types.1
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
For further information visit https://errors.pydantic.dev/2.6/v/string_type
There is also many of these:
2024-05-02 10:41:58.981 | ERROR | desci_sense.shared_functions.web_extractors.citoid:return_default_value:58 - Max retries exceeded. Returning default value.
Hmm @ShaRefOh can you show me where I can run the script? It would help me debug it faster. Thanks
@ronentk I just pushed a few extra cells to nlp/notebooks/multi_chain_parser_example.ipynb
so you can download the dataset and run the parser over it. Look at the last few cells
@ShaRefOh I pushed two fixes, maybe you can check again?
Great! I rechecked it a few times. It seems to do the job, and no 'default' values are returned! @ronentk Can you push this into the dev branch?
Great, I'll do it next week with the multi ref feature #61
Here is an example tweet that the topic extraction raise an error:
('https://twitter.com/mbauwens/status/1779543397528740338', TypeError("'ContentTypeError' object does not support item assignment"))