ayushkarnawat / profit

Exploring evolutionary protein fitness landscapes
MIT License
1 stars 0 forks source link

Unable to load TFRecords file into torch dataset #81

Closed ayushkarnawat closed 4 years ago

ayushkarnawat commented 4 years ago

When attempting to load any TFRecords file into a torch dataset, the dataset cannot be properly queried. Similar to #76, but instead of a RuntimeError: Failed to read the record., we get a different error this time (see below).

from torch.utils.data import DataLoader
from profit.utils.data_utils.datasets import TorchTFRecordsDataset

data = TorchTFRecordsDataset("data/3gb1/processed/transformer_fitness/primary.tfrecords")
loader = DataLoader(data, batch_size=64, num_workers=2)
for batch in loader:
    print([arr.shape for arr in batch.values()])

Current behavior

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/Users/ayushkarnawat/miniconda3/envs/chem/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 28, in fetch
    data.append(next(self.dataset_iter))
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/profit/profit/utils/data_utils/datasets.py", line 299, in __iter__
    for record in records:
  File "/Users/ayushkarnawat/Documents/dev/python_workspace/profit/profit/utils/data_utils/tfreader.py", line 133, in tfrecord_loader
    value = np.frombuffer(value[0])
ValueError: buffer size must be a multiple of element size

Expected behavior

[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([64, 56]), torch.Size([64, 1])]
[torch.Size([58, 56]), torch.Size([58, 1])]