datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
154 stars 51 forks source link

collect_mail.py -u doesn't quit gracefully, doesn't produce csv anymore, and throws error messages #425

Closed nllz closed 3 years ago

nllz commented 3 years ago

After collecting all mail of one list, the script tries to read the archive again infinitely without exiting before finally crashing with the following error https://gist.github.com/nllz/30c987f17b89ac2afd4380100c9a97f9

It crashes without producing a csv file for the list.

During collection this error occurs: /home/gagarin/Data/bigbang/bigbang/mailman.py:262: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

This also causes collect_mail.py -f to crash and not progress to the next list in the file.

npdoty commented 3 years ago

Appears to be an infinite loop problem (a side effect of our not being careful enough about what is a URL to collect archives from and what is the name of a mailing list, I suspect). I don't think that CSV generation is necessary (mail archives can more easily be loaded straight from mbox or something similar), but the infinite loop where it tries to open a file and collect from the Web will be blocking.

I think the YamlLoadWarning is unrelated, although it is probably an indicator of an issue that should be fixed (low priority).

@nllz, if you're around, do you have the exact arguments you passed that triggered this infinite recursion? That might speed up our debugging.

nllz commented 3 years ago

problem is that most notebooks now depend on the csv's, so we kinda need them :)

I get the error when I collect any mailinglists, so both this:

python3 bin/collect_mail.py -f examples/url_collections/mm.ietf.org.txt

and this:

python3 bin/collect_py -u https://www.ietf.org/mail-archive/text/ietf/

produces an error. Same with ICANN mailinglists. All with a fresh install.