Create masterfile for available datasets

mauriciogtec commented 1 year ago

Adds a datamaster class that lists available datasets and facilitates accessing information about the url, metadata, etc. The main functionality is in datasets/datamaster.py. There are two key files datasets/masterfile.csv and datasets/collectionts.csv. Fixes #58

codecov-commenter commented 1 year ago

Codecov Report

Merging #59 (8fa5f37) into dev (50a0c7d) will increase coverage by 7.52%. The diff coverage is 96.55%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##              dev      #59      +/-   ##
==========================================
+ Coverage   47.50%   55.02%   +7.52%     
==========================================
  Files           4        6       +2     
  Lines         160      189      +29     
==========================================
+ Hits           76      104      +28     
- Misses         84       85       +1

Flag	Coverage Δ
unittests	`55.02% <96.55%> (+7.52%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
spacebench/datasets/datasets.py	`42.37% <ø> (ø)`
tests/test_datamaster.py	`93.75% <93.75%> (ø)`
spacebench/datasets/datamaster.py	`100.00% <100.00%> (ø)`

atrisovic commented 1 year ago

maybe the master file csvs can also be on DV, so we maintain datasets in a single location instead of multiple locations

mauriciogtec commented 1 year ago

@atrisovic I am fine if the data is on DV later on if we can write code that accesses it and checks against it every time. But I don't want to hard-code logic about what dataset to download anymore (as it is now) based on the patterns of existing files, since that will scale.

This line is an example: https://github.com/NSAPH-Projects/space/blob/7b402aba31edf3c78eb4accdc8f56c46afda0a4b/spacebench/api/dataverse.py#L42-L43

NSAPH-Projects / space

Create masterfile for available datasets #59

Codecov Report