facebookresearch / gtn

Automatic differentiation with weighted finite-state transducers.
MIT License
453 stars 40 forks source link

set_weights has some numerical instability around zero values #46

Closed brianyan918 closed 3 years ago

brianyan918 commented 3 years ago

Hi @awni!

I followed a similar implementation to the TransducerLossFunction in gtn_applications. I'm wondering whether you also experienced some instability when using these lines:

cpu_data = transition_params.cpu().contiguous() transitions.set_weights(cpu_data.data_ptr())

I experienced this problem where a zero value in a tensor could become a very large, very small, or even nan value after passing through the cycle of gpu_tensor-->cpu_tensor-->gtn-->numpy.

I reproduced this behavior here: https://colab.research.google.com/drive/1AuZJlukSEwadrH6cokrqEjABBWuHJ8wQ?usp=sharing

For example, two of my zero values in a GPU tensor shifted while the remaining did not. And this does not happen if the tensor is only on CPU.

Do you have any suggestions on how to prevent this?

Copying @siddalmia @sw005320 as well.

awni commented 3 years ago

Hi @brianyan918,

The problem is with getting the CPU tensor in the same line:

graph.set_weights(weights.cpu().contiguous().data_ptr())

The tensor will be deconstructed and memory reclaimed because you are not holding a reference to it. You can fix this by changing it to two lines:

cpu_weights = weights.cpu().contiguous()
graph.set_weights(cput_weights.data_ptr())

Admittedly this is a subtle issue, we should try and find some guardrails for this mistake as we ran into it ourselves at least once!

brianyan918 commented 3 years ago

Wow! That is not at all what I was expecting to be the issue. Thanks for the fast response!

awni commented 3 years ago

@brianyan918 FWIW this example in the docs has a detailed comment about this issue. Might be worth going over as you are building a PyTorch integration.

Also if you have suggestions for how to help users of GTN avoid this problem in the future please let me know!

brianyan918 commented 3 years ago

Thanks for the useful link. The documentation very clearly describes the issue, so it was my mistake to have not checked it before.