bshillingford / python-torchfile

Deserialize Lua torch-serialized objects from Python
BSD 3-Clause "New" or "Revised" License
216 stars 25 forks source link

Conversion from lua table -> python OrderedDict? #10

Closed tastyminerals closed 7 years ago

tastyminerals commented 7 years ago

I have a table stored in .t7 file with the following structure:

{
  i2w -- {idx: word}
  tensor -- FloatTensor - size: vocabsize x 300
  w2i -- {word: idx}
}

Where tensor rows correspond to i2w indeces table {idx: word}. I've checked torchfile.load('test.t7') returned dict and unfortunately it looks like it does not keep the original tensor structure where rows represent table indeces. Can you confirm? So, basically in order to convert tensor to python dict correctly, I need to convert tensor to lua table of tensors?

bshillingford commented 7 years ago

Such a file should work perfectly fine. Note of course that lua's data structures do not directly correspond to Python ones, so there's some heuristics you can turn off if you don't like them. What is the problem you're facing? please be more specific.

tastyminerals commented 7 years ago

So, I have a .t7 file as noted above. It contains a torch tensor [6000x300], i2w and w2i tables. The tables contain {index: word} and {word: index} data respectfully. The index value from both tables corresponds to each row in tensor. This is needed in order to extract for each word a corresponding vector.

Here is the problem:

th> model = torch.load('test.t7')
th> t = model['tensor']
th> w2i = model['w2i']
th> i2w = model['i2w']
th> i2w[w2i['house']]
house

Doing the same in ipython:

import torchfile
model = torchfile.load('test.t7')
t = model['tensor']
w2i = model['w2i']
i2w = model['i2w']
i2w[w2i['house']]
issues

but it should be word "house" if everything is converted correctly because w2i is "index" -> "word" and i2w "word" -> "index" map.

If we compare the rows in lua tensor to the rows in converted python tensor they do not correspond such that lua tensor[1] != python tensor[0] and it is critical that numpy array indices == torch tensor rows in my case. Hence was my question, to what extend torchfile replicate .t7 structure? Because if it can't convert two identical "idex" -> "word" and "word" -> "index" maps, I can't use it unfortunately.

bshillingford commented 7 years ago

Turn off use_list_heuristic, see docstrings for details. In short, Lua is 1 indexed and Python is 0 indexed. I use a basic heuristic to distinguish between tables used as lists and tables used as dicts.

As I said before, there's no 1:1 correspondence between the language's data structures.