datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

domains dataset, used in multiple notebooks #528

Closed sbenthall closed 2 years ago

sbenthall commented 2 years ago

This PR models a solution to #509 by putting the known email domain categorization metadata in a datasets section of the bigbang module. This is now potentially documented in the autodocs that go to the RTD website. The same dataset submodule can then be used across multiple notebooks.

For this PR, some remaining work:

The idea is that this models what can be done with other data we package with BigBang, such as the organizational metadata collected by @nllz

sbenthall commented 2 years ago

See also #437

sbenthall commented 2 years ago

This PR now has docs and notebooks with a respectable sample of the IETFs archives for demonstration. It's ready to be reviewed and, in my view, merged.

Christovis commented 2 years ago

The new file datasets/domains/all_email_provider_domains.txt contains all email domains found in the IETF datatracker correct? I guess I should scan over them and add any that appear in 3GPP but are currently missing?

sbenthall commented 2 years ago

The new file datasets/domains/all_email_provider_domains.txt contains all email domains found in the IETF datatracker correct? I guess I should scan over them and add any that appear in 3GPP but are currently missing?

No, it does not contain all those domains. It contains only domains that have been identified as generic email address domains. These get combined with another list of domains of known categories (which is much shorter).

It's a good point that I should document this more carefully. I'll do that before merging.