complementizer / wcep-mds-dataset

MIT License
56 stars 14 forks source link

Error creating the dataset #1

Closed mikelewis0 closed 4 years ago

mikelewis0 commented 4 years ago

Hi, I'm trying to follow the steps in the README to create the dataset. The first two steps seemed to work ok, but then I hit this error in combine_and_split. Can you tell me how to fix this?

350000 cc articles done, 2677/10200 clusters done 360000 cc articles done, 2706/10200 clusters done Traceback (most recent call last): File "combine_and_split.py", line 127, in main(parser.parse_args()) File "combine_and_split.py", line 109, in main clusters, args.cc_articles, id_to_cluster_idx, tmp_clusters_path File "combine_and_split.py", line 38, in add_cc_articles_to_clusters c.setdefault('cc_articles_filled', []) AttributeError: 'NoneType' object has no attribute 'setdefault'

complementizer commented 4 years ago

Will look into this ASAP!

complementizer commented 4 years ago

Not 100% sure still why your error was caused, probably due to an article from Common Crawl stored multiple times. I reproduced it only by inserting duplicate articles in data/cc_storage/cc_articles.jsonl. I made some changes in combine_and_split.py that should prevent that bug in such a case. Also found another bug in combine_and_split.py (only if --max-cluster-size was used before). Let me know if that fixes it for you.

shahbazsyed commented 4 years ago

I get this error on the first step

Traceback (most recent call last):
  File "extract_wcep_articles.py", line 142, in <module>
    main(parser.parse_args())
  File "extract_wcep_articles.py", line 118, in main
    write_jsonl(articles, outpath, mode='a')
NameError: name 'write_jsonl' is not defined
complementizer commented 4 years ago

I get this error on the first step

Traceback (most recent call last): File "extract_wcep_articles.py", line 142, in main(parser.parse_args()) File "extract_wcep_articles.py", line 118, in main write_jsonl(articles, outpath, mode='a') NameError: name 'write_jsonl' is not defined

Is fixed now.