datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
148 stars 52 forks source link

Use `collect-mail --url` for W3C/3GPP/IEEE/etc #597

Open laurenmarietta opened 1 year ago

laurenmarietta commented 1 year ago

The documentation for how to scrape datasets shows that you can use either collect-mail --url or collect-mail --file when scraping IETF mailing lists, but only collect-mail --file when scraping W3C/3GPP/IEEE/etc mailing lists.

From my (admittedly limited) poking around in the code, it seems like mailman.collect_archive_from_url could be pretty simply rewritten using the code already in the documentation (linked above) to allow the --url option to work for all of the different mailing list types. Which I imagine might be useful for those who are coming to this package without necessarily wanting to download hundreds of mailing lists in one go?

(Please forgive if there is an existing issue about this or if I've wildly misunderstood the code in mailman.py, I've just been getting acquainted with the package! 😅 )

sbenthall commented 1 year ago

Thanks for this. It's right on. It's related to an issue that's just come up, which is that it's much easier to download mbox files from the new IETF mailing list archive interface. So we will need a mailman ingest from files very soon.

Streamlining the CLI so that it automatically recognizes whether something is a URL or a file name is a nice idea.