datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
149 stars 52 forks source link

consistent data storage and access #509

Closed sbenthall closed 2 years ago

sbenthall commented 2 years ago

BigBang now includes data as part of its repository: tables of labeled organizations, labeled email domains, lists of mailing list URLs, etc.

(I'm not talking about PII and email data here. I'm talking about the other stuff.)

Currently, this material is spread ad hoc through the examples/ directory, in a directory associated with a particular project or notebook.

But as integrate these data sets, the directory distinctions in the examples/ directory start to not make sense.

Moreover, it is best practice to, in these cases, provide a programatic interface to the data (not just a flat CSV file) so that it can be more easily documented.

Consider this way of handling data within a git repository: https://github.com/econ-ark/HARK/tree/master/HARK/datasets

In that project, datasets is a submodule of the Python package. That submodule has several further submodules for types of data and methods that relate to programmatic access. These python methods can then all be documented using docstrings and that documentation shows up on ReadTheDocs as API documentation.

sbenthall commented 2 years ago

528 models how to package a dataset as part of the BigBang repository.

I think it would be best to do something similar with the organization categories. Related to #532