datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

combine load_data and open_list_archives ? #512

Open sbenthall opened 2 years ago

sbenthall commented 2 years ago

https://github.com/datactive/bigbang/pull/500/files#r753826294

sbenthall commented 2 years ago

Reading over this code again, it seems like load_data and open_list_archives actually perform the same basic task, just that load_data only opens CSV files and open_list_archives looks for and opens MBOX files with various filenames. The mbox parameter in both doesn't seem to do much (for load_data it switches to open_list_archives and for open_list_archives it's just a flag for interpreting the path parameter as a single file name).

Could we have a single function that loads the already-collected mailing list archives for a particular list and returns it as a dataframe? And then internally bigbang can detect whether it's stored as a csv or a bunch of mbox files?

How is the LISTSERV data stored locally after it has been collected, @Christovis ?

Does your LISTSERV code use either of these methods, or does it use its own alternative version of them?

Does the W3C scraper use mbox or csv format, finally?

Maybe we should settle on a canonical data format for email archives of all kinds.

Or, otherwise, we should maybe separate the storage of "raw" email data, when available, and the nicely preprocessed into schematized CSV format that we support deeper analysis on.

Christovis commented 2 years ago

> How is the LISTSERV data stored locally after it has been collected, @Christovis ? After a mailing archive or list has been scraped and single messages are a mboxMessage type, they can be transformed and stored using the following functions: to_dict(), to_pandas_dataframe(), to_mbox(). Thus they can be stored as pickle, hdf, csv, and mbox files.

> Does your LISTSERV code use either of these methods, or does it use its own alternative version of them? The listserv implementation does not use load_data nor open_list_archives as those implementations have issues.

> Does the W3C scraper use mbox or csv format, finally? I have started refactoring the w3crawler.py class to adopt the class and function structure of listserv. This will make it possible to create ultimately and abstract class to avoids code duplication.

sbenthall commented 2 years ago

Thanks @Christovis . This all makes a lot of sense. Another question: How does the LISTSERV functionality load the data from csv or mbox?

If you think the design you've used in LISTSERV is good we could move everything over to that way.

Christovis commented 2 years ago

> How does the LISTSERV functionality load the data from csv or mbox? It is assumed that the user can by themselves identify the file format in which the data is stored and use the adequate function to load the data into runtime memory. Thus, if the data is stored in a format as @nllz obtained from someone at 3GPP, then the from_mailing_lists() or from_listserv_directory() functions can be used (actually, I think these function names are ill suited and I should find more adequate once). If the data is inside .mbox files, one can also use from_mbox(). Otherwise, it is assumed that other python packages are used to load the data. Thus if it is in .csv then use pandas.DataFrames() to read the data, and use, e.g., ListservArchive.from_pandas_dataframe() to initiate a class instance from it.

> If you think the design you've used in LISTSERV is good we could move everything over to that way. I have tried to brake the code up in small pieces so that it is easier to understand, easier to unit test, and therefore less likely to brake. In that sense I think it is good to adapt this scheme. However, I also know that while refactoring W3C into the same format I will definitely find new ways how to improve the code in certain (hopefully minor) ways.

sbenthall commented 2 years ago

How much of the the ListservArchive functionality is specific to Listserv-originating data, and how much of it could be used on any email data stored in csv or mbox?

Christovis commented 2 years ago

At the moment I believe we could have an:

  1. MessageParser(ABC) that can contain functions like from_url(), create_email_message(), get_datetime(), to_dict(), to_pandas_dataframe(), to_mbox() such that they don't need to be duplicated. On the other hand, functions like _get_headerl() or _get_body() are too specific to the mail archive managing tool and can't be easily abstracted out.

  2. List(ABC) can have functions like from_url(), from_messages(), from_mbox(), from_listserv_files(), from_listserv_directories(), to_dict(), to_pandas_dataframe(), to_mbox(), __len__(), __iter__(), __getitem__(). But anything else lik get_message_urls() are too specific and need to be implemented for 3GPP and W3C seperately.

  3. Archive(ABC) which is similar to List(ABC).

This idea is implemented in PR #534.

Christovis commented 2 years ago

@sbenthall given the new code structure that is emerging (at least for 3GPP, W3C, IEEE) is this issue still relevant or does it need to be rephrased?

sbenthall commented 2 years ago

Well, I think it would be best if the Mailman code was refactored to fit into the new code structure from #534

I'm not sure what that means for load_data or open_list_archives yet. It will require a deep dive.