Create scraping tool to access all of the emails in the hyperkitty archive

aicoe-aiops / fedora-mailing-list-analysis

This will be the repo for the Fedora mailing list sentiment analysis project

Other

0 stars 2 forks source link

Create scraping tool to access all of the emails in the hyperkitty archive #2

Closed cdolfi closed 4 years ago

cdolfi commented 4 years ago

The fedora mailing list as an options to option to download the entire achieve only retrieves a portion of the emails (2003-2008ish). A new way to retrieve all of the mbox files must be created:

Acceptance criteria:

[x] Create a notebook that retrieves all of the the mbox files and saves them to the bucket OR
[ ] Discover how to get the "entire archive" download option to retrieve the entire achieve

MichaelClifford commented 4 years ago

@cdolfi can you use the "start" and "stop" parameters in the download url? something like this to get all the emails from October 2019 to today?

https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start=2019-10-01&end=2020-11-01

cdolfi commented 4 years ago

@cdolfi can you use the "start" and "stop" parameters in the download url? something like this to get all the emails from October 2019 to today?
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start=2019-10-01&end=2020-11-01

Checked it out and works perfect. I will use this to get 2ish years of data at a time and parse it in these sized chunks. Thank you for the advise!

MichaelClifford commented 4 years ago

:partying_face: Glad that works! Sounds good.

oindrillac commented 4 years ago

@cdolfi have we figured out a way to extract all files using the start and stop parameters ? So is your note in the parsing_mob.ipynb nb - "Update later with retrieval from buckets when it is discovered how to get all emails from the archieve" out of date? If so, can we update the notebook to fetch and store data from the s3 bucket?

MichaelClifford commented 4 years ago

Is this issue closed? The current notebook 'parsing_mbox.ipynb' doesn't scrape the data from hyperkitty archives at the moment. That should be implemented either in that notebook or another before this issue can be considered closed, i think.