Closed cdolfi closed 4 years ago
@cdolfi can you use the "start" and "stop" parameters in the download url? something like this to get all the emails from October 2019 to today?
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start=2019-10-01&end=2020-11-01
@cdolfi can you use the "start" and "stop" parameters in the download url? something like this to get all the emails from October 2019 to today?
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start=2019-10-01&end=2020-11-01
Checked it out and works perfect. I will use this to get 2ish years of data at a time and parse it in these sized chunks. Thank you for the advise!
:partying_face: Glad that works! Sounds good.
@cdolfi have we figured out a way to extract all files using the start and stop parameters ? So is your note in the parsing_mob.ipynb
nb - "Update later with retrieval from buckets when it is discovered how to get all emails from the archieve" out of date? If so, can we update the notebook to fetch and store data from the s3 bucket?
Is this issue closed? The current notebook 'parsing_mbox.ipynb' doesn't scrape the data from hyperkitty archives at the moment. That should be implemented either in that notebook or another before this issue can be considered closed, i think.
The fedora mailing list as an options to option to download the entire achieve only retrieves a portion of the emails (2003-2008ish). A new way to retrieve all of the mbox files must be created:
Acceptance criteria: