LLNL / GPLaSDI

http://software.llnl.gov/GPLaSDI/
MIT License
32 stars 3 forks source link

Modularizing encoder/decoder in Autoencoder class #10

Closed dreamer2368 closed 1 week ago

dreamer2368 commented 3 weeks ago

Using DistributedDataParallel for data parallelism requires access to encoder and decoder as a torch.nn.Module. Current Autoencoder class provide encoder and decoder as member functions, though DistributedDataParallel cannot use custom member functions except forward.

Per @punkduckable , we should implement multihead attention properly as a layer rather than an activation function. While this PR simply translates the current implementation, this is posted as issue #13 .

dreamer2368 commented 1 week ago

Overall, the code looks quite good (I went through it in detail yesterday and agree with pretty much all of the changes). There is one thing I would like to see changed, however: multiheaded attention. This isn't really an activation function, even though torch's api classifies it as such. It implements the attention layer in a transformer. This is really designed for mapping finite sequences to finite sequences. I do not think it makes any sense in this context. To make this point even clearer, notice that the apply_attention function uses the input matrix, x, for keys, queries, and values. This is a little strange to say the least. I think this should be removed from the MLP class. This would mean removing "multihead" from the activation dictionary, removing num_heads as an argument to the MLP initializer, and removing the apply_attention method. Below is a copy of latent_space.py with these changes implemented:

@punkduckable , thanks for implementing a new version for this. As I posted in the PR, this PR simply translates the current implementation. While we could add a new change right here, I suggest you making it as a future PR in order to avoid code conflict for the next PRs from #11 to PR #16. For record, I also put this as issue #13 .

dreamer2368 commented 1 week ago

Overall, the code looks quite good (I went through it in detail yesterday and agree with pretty much all of the changes). There is one thing I would like to see changed, however: multiheaded attention. This isn't really an activation function, even though torch's api classifies it as such. It implements the attention layer in a transformer. This is really designed for mapping finite sequences to finite sequences. I do not think it makes any sense in this context. To make this point even clearer, notice that the apply_attention function uses the input matrix, x, for keys, queries, and values. This is a little strange to say the least. I think this should be removed from the MLP class. This would mean removing "multihead" from the activation dictionary, removing num_heads as an argument to the MLP initializer, and removing the apply_attention method. Below is a copy of latent_space.py with these changes implemented:

@punkduckable , I saw that you already implemented this on the later PR #15. It doesn't make sense to duplicate the same feature here that will be merged later again.