Indexing into Datasets for Random Access

At present, there's no good API to access a random element of a dataset. For example, suppose I have a large DiskDataset with 100 million entries, a shard size of 10,000 and I'd like to access the 50 millionth entry. I'd currently have to walk through the full dataset with iterbatches() until I reached this element. Here's an alternative proposed API that would be much nicer.

>>> dataset = dc.data.DiskDataset.create_dataset(...)
>>> len(dataset)
100000000
>>> dataset[50000000] # Yes this is actually the 50,000,001-th entry
(array([....]), 1.0, 1.0, "50 millionth element label")

This can be extended to support slices as well.

>>> dataset[50000000:51000000]
(array([...]), array([...]), array([...]), array([...]))

One design question here is whether slicing should return a new Dataset or numpy arrays. This could potentially be controlled by a user set flag.

I'd love some feedback on this design/idea. CC @peastman @vsomnath

deepchem / deepchem

Indexing into Datasets for Random Access #1850