Closed Christovis closed 2 years ago
That sounds good. There is certainly a lot of room for improvement in the code.
I would recommend breaking this down into small incremental changes that can be more easily discussed, approved, and merged.
Ok, I've been reviewing these files.
I think the changes needed are:
load_data
and open_list_archives
to archive.py, and make sure their arguments are consistent with respect to using the name of an archive (possibly reducing a URL, if given, to a name)collect_from_url
on open_list_archives
Am I correct in thinking that the archive.py methods are all general to archives whether they are retrieved using mailman.py, or ,listserv.py, or other ingress-related modules?
"the archive.py methods are all general to archives", this is what we want it to be. Currently however, I have developed the Listserv environment independent from archive.py and I will work on merging them two together. This means merging the class ListservArchive into archive.py.
The modules in ListservArchive
that mimic load_data
and open_list_archives
are:
I think one major difference in the approach of open_list_archives
and the one take in ListservArchive
(by using the from_* modules) is that in the latter one assumes that the user knows what format it is, such that the function doesn't need to find that out and resulting in less ambiguous meaning of module arguments such as url
(which in open_list_archives
can be three different things).
I like this design, @Christovis , and maybe it's a solution to the problem raised by @npdoty here:
https://github.com/datactive/bigbang/pull/500#discussion_r753826294
Maybe it would make sense for Archive
to be a more abstract class and for it to be subclassed for data sources like ListservArchive
, MailmanArchvie
, etc.
At the moment mailman.py contains functions that are either not Gnu-Mailman specific or better placed in archive.py. Furthermore are certain arguments ambiguous. Thus this file should be refactored, and along the way separate mail collecting/provenance stuff and mailman-specific crawling/parsing.