DeepChainBio / bio-datasets

Free collection of Bio datasets and embeddings
Apache License 2.0
34 stars 10 forks source link

PyPI License Python 3.7 Code style: black Dependencies

Bio-datasets

Open-source collection of biology datasets and pre-trained embeddings. :dna: :closed_book:

Description

bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets. For these datasets, pre-trained contextual embeddings are also available.

Installation

Install the required dependencies with pip install bio-datasets.

How it works

from biodatasets import list_datasets, load_dataset

print(list_datasets())

# Load your dataset
pathogen = load_dataset("pathogen")

# Display the available columns and embeddings
print(pathogen)

# Get data from your dataset
X, y = pathogen.to_npy_arrays(input_names=["sequence"], target_names=["class"])
embeddings = pathogen.get_embeddings("sequence", "protbert", "cls")

# Get a full description of your dataset
pathogen.display_description()

How to contribute

Check out how to setup the project or add a public dataset in CONTRIBUTING.md.