At present, there's no good API to access a random element of a dataset. For example, suppose I have a large DiskDataset with 100 million entries, a shard size of 10,000 and I'd like to access the 50 millionth entry. I'd currently have to walk through the full dataset with iterbatches() until I reached this element. Here's an alternative proposed API that would be much nicer.
>>> dataset = dc.data.DiskDataset.create_dataset(...)
>>> len(dataset)
100000000
>>> dataset[50000000] # Yes this is actually the 50,000,001-th entry
(array([....]), 1.0, 1.0, "50 millionth element label")
At present, there's no good API to access a random element of a dataset. For example, suppose I have a large
DiskDataset
with 100 million entries, a shard size of 10,000 and I'd like to access the 50 millionth entry. I'd currently have to walk through the full dataset withiterbatches()
until I reached this element. Here's an alternative proposed API that would be much nicer.This can be extended to support slices as well.
One design question here is whether slicing should return a new
Dataset
or numpy arrays. This could potentially be controlled by a user set flag.I'd love some feedback on this design/idea. CC @peastman @vsomnath