deepchem / deepchem

Democratizing Deep-Learning for Drug Discovery, Quantum Chemistry, Materials Science and Biology
https://deepchem.io/
MIT License
5.5k stars 1.68k forks source link

Indexing into Datasets for Random Access #1850

Open rbharath opened 4 years ago

rbharath commented 4 years ago

At present, there's no good API to access a random element of a dataset. For example, suppose I have a large DiskDataset with 100 million entries, a shard size of 10,000 and I'd like to access the 50 millionth entry. I'd currently have to walk through the full dataset with iterbatches() until I reached this element. Here's an alternative proposed API that would be much nicer.

>>> dataset = dc.data.DiskDataset.create_dataset(...)
>>> len(dataset)
100000000
>>> dataset[50000000] # Yes this is actually the 50,000,001-th entry
(array([....]), 1.0, 1.0, "50 millionth element label")

This can be extended to support slices as well.

>>> dataset[50000000:51000000]
(array([...]), array([...]), array([...]), array([...]))

One design question here is whether slicing should return a new Dataset or numpy arrays. This could potentially be controlled by a user set flag.

I'd love some feedback on this design/idea. CC @peastman @vsomnath

peastman commented 4 years ago

I think this is a good idea. I suggest the return values should exactly match what you get from itersamples().