datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
154 stars 51 forks source link

Enhancement: Reading & analysing mailing lists in the LISTSERV16.5 format #404

Closed Christovis closed 3 years ago

Christovis commented 3 years ago

This PR contains a suggestion to extend BigBang to read and analyze mailing list in the LISTSERV16.5 format. Examples that use this format are 3GPP and IEEE. This is done by converting the files from LISTSERV16.5 to GNU-Mailman and use pythons standard library mailbox to do the rest (thus building on top of analysis routines that are already present in BigBang).

Notable Changes: The Archive class has now some classmethods to initialize it from one(multiple) file(s) or a dataframe. Below is an example to initialize it from multiple files with one directory.

Archive.from_files(
    dir_paths=["../../archives/3GPP_TSG_SA_WG2_UPCON/"],
    file_names=["3GPP_TSG_SA_WG2_UPCON*"],
    email_list_software="LISTSERV",
)

Pleas note also that this PR contains the pre-commit routine I suggested in this PR. Therefore there are also changes in other files.

sbenthall commented 3 years ago

Let's merge #403 then come back to this. It will be easier to see the specific changes for this PR then I think. The thing I can't identify yet without a deeper review is --- are there automated tests for the new functionality? Ideally, some dummy data in LISTSERV format. It's a bit exotic, so I wouldn't even know what that looks like.

Christovis commented 3 years ago

Yep, I totally agree :+1: I was just too impatient and wanted show Niels how things are atm. :-P Tests for the new listserv.py file are added now.

sbenthall commented 3 years ago

Please see comments on the code. Also, the code fomatting PR #408 looks like it has conflicts with this PR? please resolve merge conflicts when that's done I'll test locally

Christovis commented 3 years ago

As I got a working the listserv 16.5 scraper on my local repo, which avoids needing to have a data containing peoples names, email addresses, and message content for test purposes. Therefore I think it is best to cancel this PR. I hope to open a new PR asap in which I have a unified routine to scrape, read and save various data formats of listserv 16.5 mailing lists.

sbenthall commented 3 years ago

Ok, on your recommendation I will close this PR.

I look forward to reviewing the new one when it's ready! This is an exciting feature and I'm sure it will lead to some great work going forward.