aicoe-aiops / fedora-mailing-list-analysis

This will be the repo for the Fedora mailing list sentiment analysis project

Other

0 stars 2 forks source link

Discover how to download and save mbox files via notebook #7

Closed cdolfi closed 3 years ago

cdolfi commented 3 years ago

The current data was manually downloaded from the site and saved. Now I want to go back and write a notebook to do this automatically. The issues I have run into has been from trying to use some get, request, etc call to retrieve the file and get it to where python will allow you to unzip the gz portion to get to the mbox.

Acceptance requirements:

[x] discover how to correctly access the url to solve gz issues
[x] gz to mbox
[x] save mbox to the bucket

MichaelClifford commented 3 years ago

Would something like this help for requirements 1 and 2?

import wget
import os

start_month = "10"
start_year = "2020"
end_month = "11"
end_year = "2020"

url = f"https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/export/devel@lists.fedoraproject.org.mbox.gz?start={start_year}-{start_month}-01&end={end_year}-{end_month}-01"

wget.download(url, out="../data/")

cmd = "gzip -d ../data/devel@lists.fedoraproject.org.mbox.gz"
os.system(cmd)

oindrillac commented 3 years ago

@cdolfi did the above solution work for you?

cdolfi commented 3 years ago

@oindrillac Some parts yes. I will be posting a WIP soon to update and look for feedback/suggestions

cdolfi commented 3 years ago

@MichaelClifford @oindrillac new problem has developed from this: the way that I am writing back to the mbox file for some reason is not coming out properly formatted. If I try to open it using .mbox I am getting formatting error. Any suggestions on things to try?

MichaelClifford commented 3 years ago

@cdolfi can you add a link to the notebook/PR where you are running into this problem?

cdolfi commented 3 years ago

10 in the retrieve_mbox file. The problem is coming from going from the unzipped file to the .mbox. I am assuming the issue is in the gunzip function

oindrillac commented 3 years ago

@cdolfi can you elaborate on the issue that you are seeing as you try to convert from .gz to .mbox and share an example snippet? When I tried to run your notebook, I did not notice a difference in formatting between the file fetched manually and that generated from the notebook.

MichaelClifford commented 3 years ago

@cdolfi yeah, can you add some more details to the problem you are having? I just ran both notebooks to download and then parse some data and I did not get an error. It seems right to me. Here is an image of my output from parsing_mbox.ipynb.

cdolfi commented 3 years ago

It allows me to save it, but when I try to open the the mbox from the bucket this is the error that I get: Screenshot from 2020-11-23 14-28-31

oindrillac commented 3 years ago

@cdolfi I was also able to run both your notebooks (read the saved mbox from retrieve_mbox into parsing_mbox).

I think the unzipped mbox saved from retrieve_mbox is formatted correctly. I suspect an error in the way you are reading it from S3 in parsing_mbox. Can we discuss that in #5 ? Can you add a separate PR (from a different branch) for updates to parsing_mbox tackling issue #5 ?