API for loading or fetching data for development, benchmarking and testing

evanroyrees commented 4 years ago

Intended behavior

Fetch or load data similar to the scikit-learn API. This would provide a number of utilities to load, fetch and/or download toy and real world datasets. Each dataset that is loaded via the API should first pass a file integrity check by comparing checksums.

Some example load/fetch/download functions may include:

function	dataset	paper
`load_random(...)`	load random number of genomes fetched from NCBI	N/A
`load_simulated(...)`	load Autometa simulated community	DOI: 10.1093/nar/gkz148
`load_sharon(...)`	load Sharon dataset	DOI: 10.1101/gr.142315.112
`load_synthetic(...)`	load Autometa synthetic community	DOI: 10.1093/nar/gkz148
`load_environmental(...)`	load Autometa environmental community	DOI: 10.1093/nar/gkz148
`load_cami(...)`	load CAMI challenge dataset	DOI: 10.1186/s12859-020-03667-3

Example API usage

Example usage could look something like this:

# import load function
from autometa.validation import datasets
# load in to data object
data = datasets.load_synthetic("MIX51")

data object in this case should also contain some metadata (similar to sklearn API)

evanroyrees commented 3 years ago

We have uploaded some test datasets to a public shared google drive. The links are below:

simulated communities: drive folder synthetic communitites: drive folder

evanroyrees commented 3 years ago

A simple helper script could be written that uses gdown to download to the user provided output directory.

A simple example:

# contents of Autometa/autometa/datasets.py
import gdown
# ... 
# ...
# NOTE: dictionary of communities with their respective metagenome.fna.gz file IDs
simulated = {
    "78": "15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y"
}
file_id = ...
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, args.output)

I've skipped a few details above as this may be nice practice for someone

👀 @ajlail98

When this is finished, we can setup an entrypoint something of the likes of

autometa-download-dataset --community 78 --output <some/directory/where/I/want/to/put/these/files>

KwanLab / Autometa