calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
411 stars 126 forks source link

Load Akita TFRecords Data #123

Open stasys-hub opened 2 years ago

stasys-hub commented 2 years ago

HI! I am toying around with your awesome model and trying to recreate it in pytorch for educational purposes. I downloaded the Akita Training data and tried to load some records using the TFRecordDataset method. Then I to load an tf.train.Example so i could see how the records are formatted. Unfortunately I am getting errors while loading. The Data Loader always tells me that the data is corrupted.

Could you elaborate on how to load the data into vanilla tensorflow or pytorch?

Here"s how i tried to load it:

import json
import tensorflow as tf

akita_train_0 = tf.data.TFRecordDataset("test-0.tfr")

for d in akita_train_0:
    ex = tf.train.Example()
    ex.ParseFromString(d.numpy())
    m = json.loads(MessageToJson(ex))
    print(m['features']['feature'].keys())
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?bed6a48c-8526-438c-959b-03f682e7ac8e)
---------------------------------------------------------------------------
DataLossError                             Traceback (most recent call last)
/tmp/ipykernel_7908/2481927967.py in <module>
      1 import json
      2 
----> 3 for d in akita_train_0:
      4     ex = tf.train.Example()
      5     ex.ParseFromString(d.numpy())

/usr/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
    764   def __next__(self):
    765     try:
--> 766       return self._next_internal()
    767     except errors.OutOfRangeError:
    768       raise StopIteration

/usr/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    747     # to communicate that there is no more data to iterate over.
    748     with context.execution_mode(context.SYNC):
--> 749       ret = gen_dataset_ops.iterator_get_next(
    750           self._iterator_resource,
    751           output_types=self._flat_output_types,

/usr/lib/python3.10/site-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next(iterator, output_types, output_shapes, name)
   3015       return _result
...
-> 7164   raise core._status_to_exception(e) from None  # pylint: disable=protected-access
   7165 
   7166 

DataLossError: corrupted record at 0 [Op:IteratorGetNext]

Thank you in advance!

davek44 commented 2 years ago

You can see how examples are parsed out of the tfrecords here: https://github.com/calico/basenji/blob/master/basenji/dataset.py#L104

stasys-hub commented 2 years ago

Perfect, thank you very much!