ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes
MIT License
1 stars 0 forks source link

Unable to slice torch dataset(s) #73

Closed ayushkarnawat closed 4 years ago

ayushkarnawat commented 4 years ago

When attempting to slice a dataset in the pythonic way (i.e. dataset[0:5]), to obtain certain batches of data, we get the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-158-1c61cf67eba7> in <module>
----> 1 dataset[0:5]

~/Documents/dev/python_workspace/profit/profit/utils/data_utils/datasets.py in __getitem__(self, idx)
    336 
    337         with self.db.begin() as txn, txn.cursor() as cursor:
--> 338             example = pkl.loads(cursor.get(self.keys[idx]))
    339             return {key: torch.FloatTensor(arr) for key,arr in example.items()}
    340 

TypeError: a bytes-like object is required, not 'list'
ayushkarnawat commented 4 years ago

Should we revert back to the old style of serializing and loading the dataset as full Tensors rather than as individual examples? This would help with performance as loading a batch would be simply loading each array and slicing them at the specified indices. NOTE: This is just a conjecture, needs to be more fully verified. See TensorDataset for information on how to load the tensor datasets.

However, if we save the dataset in shards/batches (see #35), then it will become hard to save each example individually. Instead, we will instead have to concat each new array into is respective array, which might be hard when you have already saved arrays with the same names. Note that this will also require checking if the new example array's have the same size/shape as the batches in the previous batch.