KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

API for loading or fetching data for development, benchmarking and testing #110

Open evanroyrees opened 4 years ago

evanroyrees commented 4 years ago

Intended behavior

Fetch or load data similar to the scikit-learn API. This would provide a number of utilities to load, fetch and/or download toy and real world datasets. Each dataset that is loaded via the API should first pass a file integrity check by comparing checksums.

Some example load/fetch/download functions may include:

function dataset paper
load_random(...) load random number of genomes fetched from NCBI N/A
load_simulated(...) load Autometa simulated community DOI: 10.1093/nar/gkz148
load_sharon(...) load Sharon dataset DOI: 10.1101/gr.142315.112
load_synthetic(...) load Autometa synthetic community DOI: 10.1093/nar/gkz148
load_environmental(...) load Autometa environmental community DOI: 10.1093/nar/gkz148
load_cami(...) load CAMI challenge dataset DOI: 10.1186/s12859-020-03667-3

Example API usage

Example usage could look something like this:

# import load function
from autometa.validation import datasets
# load in to data object
data = datasets.load_synthetic("MIX51")

data object in this case should also contain some metadata (similar to sklearn API)

evanroyrees commented 3 years ago

We have uploaded some test datasets to a public shared google drive. The links are below:

simulated communities: drive folder synthetic communitites: drive folder

evanroyrees commented 3 years ago

A simple helper script could be written that uses gdown to download to the user provided output directory.

A simple example:

# contents of Autometa/autometa/datasets.py
import gdown
# ... 
# ...
# NOTE: dictionary of communities with their respective metagenome.fna.gz file IDs
simulated = {
    "78": "15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y"
}
file_id = ...
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, args.output)

I've skipped a few details above as this may be nice practice for someone

👀 @ajlail98

When this is finished, we can setup an entrypoint something of the likes of

autometa-download-dataset --community 78 --output <some/directory/where/I/want/to/put/these/files>