Unable to slice torch dataset(s)

ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes

MIT License

1 stars 0 forks source link

TypeError Traceback (most recent call last) <ipython-input-158-1c61cf67eba7> in <module> ----> 1 dataset[0:5] ~/Documents/dev/python_workspace/profit/profit/utils/data_utils/datasets.py in __getitem__(self, idx) 336 337 with self.db.begin() as txn, txn.cursor() as cursor: --> 338 example = pkl.loads(cursor.get(self.keys[idx])) 339 return {key: torch.FloatTensor(arr) for key,arr in example.items()} 340 TypeError: a bytes-like object is required, not 'list'

Should we revert back to the old style of serializing and loading the dataset as full Tensors rather than as individual examples? This would help with performance as loading a batch would be simply loading each array and slicing them at the specified indices. NOTE: This is just a conjecture, needs to be more fully verified. See TensorDataset for information on how to load the tensor datasets.

However, if we save the dataset in shards/batches (see #35), then it will become hard to save each example individually. Instead, we will instead have to concat each new array into is respective array, which might be hard when you have already saved arrays with the same names. Note that this will also require checking if the new example array's have the same size/shape as the batches in the previous batch.

ayushkarnawat / profit

Unable to slice torch dataset(s) #73