[NLP] Fix URL normalization

ronentk commented 5 months ago

Example - https://twitter.com/marielgoddu/status/1784709899357716521

ShaRefOh commented 5 months ago

@ronentk Is this error connected to the issue? `back (most recent call last): File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 208, in pred_labels(df=df,config=config) File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/evaluation/mulitchain_filter_evaluation.py", line 54, in pred_labels results = model.batch_process_ref_posts(inputs=inputs,active_list=["keywords", "topics"],batch_size=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/parsers/multi_chain_parser.py", line 213, in batch_process_ref_posts md_dict = extract_posts_ref_metadata_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 160, in extract_posts_ref_metadata_dict md_dict = extract_all_metadata_to_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 129, in extract_all_metadata_to_dict md_list = extract_all_metadata_by_type( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 119, in extract_all_metadata_by_type return extract_urls_citoid_metadata(target_urls, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 101, in extract_urls_citoid_metadata return normalize_citoid_metadata(target_urls, metadatas_raw, max_summary_length) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/shaharorielkagan/sensemakers/nlp/desci_sense/shared_functions/web_extractors/metadata_extractors.py", line 30, in normalize_citoid_metadata metadata["original_url"] = url


TypeError: 'ContentTypeError' object does not support item assignment`

I am getting it now when running the batches on the dataset

ronentk commented 5 months ago

@ShaRefOh not sure, please open an issue with steps to reproduce the error (a list of urls or something similar)

ShaRefOh commented 5 months ago

Ok, but for that, I will need to go through the posts one by one instead of using the batch parser function.

Common-SenseMakers / sensemakers

[NLP] Fix URL normalization #65