[NLP] Problem with citoid metadata extraction - Githubissues

Common-SenseMakers / sensemakers

Sensemakers infrastructure for developing AI-based tools for semantic annotations of social posts. Cross-poster app to publish your semantic posts on different networks.

GNU General Public License v3.0

1 stars 2 forks source link

[NLP] Problem with citoid metadata extraction #66

Closed ShaRefOh closed 1 month ago

ShaRefOh commented 2 months ago

Here is an example tweet that the topic extraction raise an error: ('https://twitter.com/mbauwens/status/1779543397528740338', TypeError("'ContentTypeError' object does not support item assignment"))

ShaRefOh commented 2 months ago

Full log: back (most recent call last): File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 208, in <module> pred_labels(df=df,config=config) File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 54, in pred_labels results = model.batch_process_ref_posts(inputs=inputs,active_list=["keywords", "topics"],batch_size=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/parsers/multi_chain_parser.py", line 213, in batch_process_ref_posts md_dict = extract_posts_ref_metadata_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 160, in extract_posts_ref_metadata_dict md_dict = extract_all_metadata_to_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 129, in extract_all_metadata_to_dict md_list = extract_all_metadata_by_type( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 119, in extract_all_metadata_by_type return extract_urls_citoid_metadata(target_urls, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 101, in extract_urls_citoid_metadata return normalize_citoid_metadata(target_urls, metadatas_raw, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 30, in normalize_citoid_metadata metadata["original_url"] = url ~~~~~~~~^^^^^^^^^^^^^^^^ TypeError: 'ContentTypeError' object does not support item assignment

ronentk commented 2 months ago

@ShaRefOh I tried this in the notebook and topics extraction did work - https://github.com/Common-SenseMakers/sensemakers/blob/44177802dbc934a8ddc2fe3ff969a514a4450d6d/nlp/notebooks/multi_chain_parser_example.ipynb

Does that notebook section work for you? (cell 6)

ronentk commented 2 months ago

It looks like there is a problem running citoid at scale since there are a small fraction of failures that aren't currently handled properly. Working on a retry functionality now.

ronentk commented 2 months ago

@ShaRefOh I pushed a fix to the nlp-dev branch, can you move there and try it? You should be able to do the batch code as usual.

ShaRefOh commented 2 months ago

@ronentk I tried it and got a different error:

ValidationError                           Traceback (most recent call last)
Cell In[7], line 2
      1 # batch process
----> 2 results = multi_chain_parser.batch_process_ref_posts(inputs,active_list=['topics','keywords'],batch_size=10)

File ~/sensemakers/nlp/notebooks/../desci_sense/shared_functions/parsers/multi_chain_parser.py:259, in MultiChainParser.batch_process_ref_posts(self, inputs, batch_size, active_list)
    257 post_processed_results = []
    258 for post, result, prompts_dict in zip(inputs, results, inst_prompts):
--> 259     post_processed_res = self.post_process_raw_results(
    260         post,
    261         prompts_dict,
    262         result,
    263         md_dict,
    264         self.config.post_process_type,
    265     )
    266     post_processed_results.append(post_processed_res)
    268 logger.debug("Done!")

File ~/sensemakers/nlp/notebooks/../desci_sense/shared_functions/parsers/multi_chain_parser.py:155, in MultiChainParser.post_process_raw_results(self, post, inst_prompt_dict, raw_results, md_dict, post_process_type)
    152     return raw_results
    154 # convert raw outputs to combined format
--> 155 combined_res = post_process_chain_output(
    156     post,
    157     raw_results,
...

ValidationError: 1 validation error for CombinedParserOutput
item_types.1
  Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
    For further information visit https://errors.pydantic.dev/2.6/v/string_type

There is also many of these: 2024-05-02 10:41:58.981 | ERROR | desci_sense.shared_functions.web_extractors.citoid:return_default_value:58 - Max retries exceeded. Returning default value.

ronentk commented 2 months ago

Hmm @ShaRefOh can you show me where I can run the script? It would help me debug it faster. Thanks

ShaRefOh commented 2 months ago

@ronentk I just pushed a few extra cells to nlp/notebooks/multi_chain_parser_example.ipynb so you can download the dataset and run the parser over it. Look at the last few cells

ronentk commented 2 months ago

@ShaRefOh I pushed two fixes, maybe you can check again?

ShaRefOh commented 2 months ago

Great! I rechecked it a few times. It seems to do the job, and no 'default' values are returned! @ronentk Can you push this into the dev branch?

ronentk commented 2 months ago

Great, I'll do it next week with the multi ref feature #61