datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

generalize get_list_name ? #501

Open sbenthall opened 2 years ago

sbenthall commented 2 years ago

There's a small function which strips down a full URL of a mailman archive to get its last part, which is used as the mailing list 'name'

https://github.com/datactive/bigbang/blob/main/bigbang/mailman.py#L185-L198

This gets used somewhat widely; as of #500 there's a reference to it in archive.py

Which raises the question of whether this function should be more general to other ingress methods, like w3c and listserv?

npdoty commented 2 years ago

Yeah, I think it would make sense to generalize the name function, and use it for those other lists as well. (Maybe we need a list of regexps that work for different email archive systems? Or, one day, a way for a new ingress system to register a function that recognizes if a URL is likely to be one of their mailing lists and return various metadata about it.)

Typically a short-name is handy because we might want to save the files to a certain directory and re-load them from there, be able to refer to a list in your code without typing the full URL, etc.

But I can also see how there might eventually be problems: these short names are not going to be globally unique, whereas list archive URLs or list email addresses would be less ambiguous.

Christovis commented 2 years ago

Yep, agree, we should find a way to generalise this method and maybe place it in utils.py ? For Listserv mailing lists I have a function ListservList.get_name_from_url(mlist_url) here that get's the list name from an URL. But the way that is done is currently unique to Listserve maybe as the URL structure is always: https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_SA_WG2 https://list.etsi.org/scripts/wa.exe?A0=3GPP_CT89_E_MEETING etc.

sbenthall commented 2 years ago

What if:

On the other hand, I feel like the notebook workflow that this 'short name' stuff was intended to support is increasingly old fashioned and not how BigBang is currently being used.

I wouldn't mind officially deprecating a lot of the old notebooks and trying to come up with a better workflow.

npdoty commented 2 years ago

I'm not sure I want subdirectories based on method/source as that isn't always consistent across a project or an SDO even.

Could we use the email address of the list as the directory name? Does ietf@ietf.org cause any problems as a directory name? Can archive ingest code always determine the email address of the list?

The email address should generally be unique and descriptive. And list archive URLs can vary over time, but the mailing list email address itself is unlikely to.

I agree that it's fine to deprecate some older notebooks or styles.

sbenthall commented 2 years ago

I like the idea of using the email address of a list as its directory name. At this time I don't know the answers to your other questions.