Closed cdolfi closed 3 years ago
Would something like this help for requirements 1 and 2?
import wget
import os
start_month = "10"
start_year = "2020"
end_month = "11"
end_year = "2020"
url = f"https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start={start_year}-{start_month}-01&end={end_year}-{end_month}-01"
wget.download(url, out="../data/")
cmd = "gzip -d ../data/devel@lists.fedoraproject.org.mbox.gz"
os.system(cmd)
@cdolfi did the above solution work for you?
@oindrillac Some parts yes. I will be posting a WIP soon to update and look for feedback/suggestions
@MichaelClifford @oindrillac new problem has developed from this: the way that I am writing back to the mbox file for some reason is not coming out properly formatted. If I try to open it using .mbox I am getting formatting error. Any suggestions on things to try?
@cdolfi can you add a link to the notebook/PR where you are running into this problem?
@cdolfi can you elaborate on the issue that you are seeing as you try to convert from .gz to .mbox and share an example snippet? When I tried to run your notebook, I did not notice a difference in formatting between the file fetched manually and that generated from the notebook.
@cdolfi yeah, can you add some more details to the problem you are having? I just ran both notebooks to download and then parse some data and I did not get an error. It seems right to me. Here is an image of my output from parsing_mbox.ipynb.
It allows me to save it, but when I try to open the the mbox from the bucket this is the error that I get:
@cdolfi I was also able to run both your notebooks (read the saved mbox from retrieve_mbox into parsing_mbox).
I think the unzipped mbox saved from retrieve_mbox is formatted correctly. I suspect an error in the way you are reading it from S3 in parsing_mbox. Can we discuss that in #5 ? Can you add a separate PR (from a different branch) for updates to parsing_mbox tackling issue #5 ?
The current data was manually downloaded from the site and saved. Now I want to go back and write a notebook to do this automatically. The issues I have run into has been from trying to use some get, request, etc call to retrieve the file and get it to where python will allow you to unzip the gz portion to get to the mbox.
Acceptance requirements:
[x] discover how to correctly access the url to solve gz issues
[x] gz to mbox
[x] save mbox to the bucket