Open evanroyrees opened 4 years ago
We have uploaded some test datasets to a public shared google drive. The links are below:
simulated communities: drive folder synthetic communitites: drive folder
A simple helper script could be written that uses gdown to download to the user provided output directory.
A simple example:
# contents of Autometa/autometa/datasets.py
import gdown
# ...
# ...
# NOTE: dictionary of communities with their respective metagenome.fna.gz file IDs
simulated = {
"78": "15CB8rmQaHTGy7gWtZedfBJkrwr51bb2y"
}
file_id = ...
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, args.output)
I've skipped a few details above as this may be nice practice for someone
👀 @ajlail98
When this is finished, we can setup an entrypoint something of the likes of
autometa-download-dataset --community 78 --output <some/directory/where/I/want/to/put/these/files>
Intended behavior
Fetch or load data similar to the scikit-learn API. This would provide a number of utilities to load, fetch and/or download toy and real world datasets. Each dataset that is loaded via the API should first pass a file integrity check by comparing checksums.
Some example load/fetch/download functions may include:
load_random(...)
load_simulated(...)
load_sharon(...)
load_synthetic(...)
load_environmental(...)
load_cami(...)
Example API usage
Example usage could look something like this:
data object in this case should also contain some metadata (similar to sklearn API)