bshillingford / python-torchfile

Deserialize Lua torch-serialized objects from Python
BSD 3-Clause "New" or "Revised" License
216 stars 25 forks source link

torchfile.load() returns empty array on torch.ByteTensor #9

Closed jakedailey1 closed 7 years ago

jakedailey1 commented 7 years ago

Disclaimer: new to trying to work with serialized data, so this may be user error.

I've been working through implementing Professor de Freitas's Machine Learning course materials in Tensorflow (here) and looking to use the data here to write the LSTM from Practical 6...I was delighted to find that you'd already gone to the trouble of writing this package!

torchfile.load() works fine for vocab.t7, but returns an empty array when I try running torchfile.load('train.t7'). When I dig into the code deeper, it seems that T7Reader correctly finds that typeidx == 4 (Torch) after calling reader.read_obj() a first time.

The code proceeds to call type_handlers but then on the second call to read_obj() within read_tensor_generic, finds that typeidx==0. The result is that storage is None, causing the code to return an empty array.

For reference, the class_name found is torch.ByteTensor and the version_number returned is 1. I'm running Python 3.5 via Conda on Windows x64.

Thank you! Jake

bshillingford commented 7 years ago

Have you tried on Linux? I suspect an issue with the size of long ints in Windows. The only supported setup is technically for files saved/loaded in the same system, both for Torch and this library. Perhaps if lua torch loads it fine on windows, they're remapping long to int64_t, rather than using windows' longs.

jakedailey1 commented 7 years ago

I think you're right. I was able to open the file on a Linux box and reformat to something I can work with on Windows.

I don't think there's a direct workaround, per this Python bug post, it seems this is bound to happen if you the original serialization was done using the machine's native formatting.

I'm sure typical users will not face this issue, but do you know if t7 files have anything explicit that we can use to warn users about these sorts of system/type mismatches (or is this just implicit in the binary)?

bshillingford commented 7 years ago

Unfortunately, binary t7 files were never meant to be portable, so no. The "fix" would be to have changeable settings for the size of each datatype when loading a file, as the t7 files don't save that themselves. I'd suggest hdf5 or ascii t7 files as a portable alternative that may suit most of your needs.

jakedailey1 commented 7 years ago

I see. Thanks for your help with this @bshillingford, I'll close this out now.