Open laurenmarietta opened 1 year ago
Thanks for this. It's right on. It's related to an issue that's just come up, which is that it's much easier to download mbox files from the new IETF mailing list archive interface. So we will need a mailman ingest from files very soon.
Streamlining the CLI so that it automatically recognizes whether something is a URL or a file name is a nice idea.
The documentation for how to scrape datasets shows that you can use either
collect-mail --url
orcollect-mail --file
when scraping IETF mailing lists, but onlycollect-mail --file
when scraping W3C/3GPP/IEEE/etc mailing lists.From my (admittedly limited) poking around in the code, it seems like
mailman.collect_archive_from_url
could be pretty simply rewritten using the code already in the documentation (linked above) to allow the--url
option to work for all of the different mailing list types. Which I imagine might be useful for those who are coming to this package without necessarily wanting to download hundreds of mailing lists in one go?(Please forgive if there is an existing issue about this or if I've wildly misunderstood the code in mailman.py, I've just been getting acquainted with the package! 😅 )