datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
154 stars 51 forks source link

Mailman.py refactor #435

Closed Christovis closed 2 years ago

Christovis commented 3 years ago

At the moment mailman.py contains functions that are either not Gnu-Mailman specific or better placed in archive.py. Furthermore are certain arguments ambiguous. Thus this file should be refactored, and along the way separate mail collecting/provenance stuff and mailman-specific crawling/parsing.

sbenthall commented 3 years ago

That sounds good. There is certainly a lot of room for improvement in the code.

I would recommend breaking this down into small incremental changes that can be more easily discussed, approved, and merged.

sbenthall commented 2 years ago

Ok, I've been reviewing these files.

I think the changes needed are:

Am I correct in thinking that the archive.py methods are all general to archives whether they are retrieved using mailman.py, or ,listserv.py, or other ingress-related modules?

Christovis commented 2 years ago

"the archive.py methods are all general to archives", this is what we want it to be. Currently however, I have developed the Listserv environment independent from archive.py and I will work on merging them two together. This means merging the class ListservArchive into archive.py.

The modules in ListservArchive that mimic load_data and open_list_archives are:

I think one major difference in the approach of open_list_archives and the one take in ListservArchive (by using the from_* modules) is that in the latter one assumes that the user knows what format it is, such that the function doesn't need to find that out and resulting in less ambiguous meaning of module arguments such as url (which in open_list_archives can be three different things).

sbenthall commented 2 years ago

I like this design, @Christovis , and maybe it's a solution to the problem raised by @npdoty here:

https://github.com/datactive/bigbang/pull/500#discussion_r753826294

Maybe it would make sense for Archive to be a more abstract class and for it to be subclassed for data sources like ListservArchive, MailmanArchvie, etc.