Customise word embedding

ChristopherLu commented 5 years ago

Hi, Nice repo!

I have a seq2seq problem where I try to 'translate' continuous sequences (e.g.,1.23, -0.56, 0.12) to integer sequence outputs (e.g., 0, 2 , 5 etc.).

For example, An input sequence: [1.21, 0.62, 0.37, -0.61, -0.66, 1.89, 0.25, 0.68] An output sequence: [3, 6, 5, 1]

Could you help instruct how to modify your code (esp. in word embedding) to serve the above task?

RRisto commented 5 years ago

Thanks! What are the output integers (some continuous variable or integers are just some tokens/id-s)? In input you don't have to provide pretrained embedding vectors (then they will be initialized and learned during training). If your output integers are just some id integers you could turn them into strings for model (and later turn output to integers). But if they are some kind of continuous variable, we should change the loss function (to RMSE maybe). If output is continuous, it raises quesion that is seq2seq preferd model type.

ChristopherLu commented 5 years ago

Thanks.

The outputs are integers ranging from 1 to 9 (excluding tokens for padding, sos and eos).
I do not think embedding is necessary if inputs are real-value (continuous) sequences, as there is no lookup table for inputs. But I guess a fully connected layer can be helpful here to transform the input dimension to hidden size.

If output is continuous, do you think seq2seq is suitable here? Or I should go for models like Wavenet

RRisto commented 5 years ago

I am afraid that continuous output needs more customizing (loss function, probably otpimizer). Currently this seq2seq works with nominal output (returns sequence of words). With audio data wavenet looks like more suitable. Also Generative Adversarial Network might be thing that helps.

ChristopherLu commented 5 years ago

But what about continuous inputs but integer (token-like) outputs? Any suggestions to modify your repo to enable this purpose? I see the seq2seq dataset python codes are very customised to NLP cases...

ChristopherLu commented 5 years ago

To be more specific, my inputs are T * 6 sequences (i.e., each data point is a 6-dim vector), with variable sequence lengths. The output is more standard that are integer sequences with variable length.

How should I change the_to_padded_target_tensor() function in data_manager.py if I want to use batch process here? For now, this padding function can only deal with T * 1 sequences of integers (token ids).

RRisto commented 5 years ago

As I understand input is similar to sequence of words? (like sentence 'this is input', where each word would be represented by 6-dim vector). Are 6-dim vectors acting like embedding (which are n-length vectors)?

ChristopherLu commented 5 years ago

No, it is not about the NLP task.

The input here is T*6 motion signals (6 is determined by the used sensor) and output are digit tokens (PIN-like). Therefore I said the embedding layer for input encoder is not needed here.

RRisto commented 5 years ago

to be honest I don't have much experience with that kind of data. One way would be to remove embedding layer from encoder and decoder and somehow feed data into encoder and decoder (eventually data is just some tensor). Tensor dimensions should match with the ones that model expects.

ChristopherLu commented 5 years ago

Yes, that's exactly what I am doing now, which is removing the word embedding layer in the encoder. However, I somehow suffer from memory leakage... Do you have any idea where of your code (perhaps the data manager code?) could potentially incur this issue? My training data is relatively big (>> 33K sequences)...

RRisto commented 5 years ago

I guess you don't need to use my data manager, because main purpose of it is to deal with language specific things (removing too long/short sequences, replacing tokens with too low frequence etc. As it keeps everthing in memory it is not too memory-efficient. If your data is already in tensor/matrix format, you could use dataloaders (and give custom collate_fn as an argument if tensors need some processing before using in training). Example how simple dataloader could be made https://pytorch.org/tutorials/beginner/nn_tutorial.html. Then you need to refactor the dataloader paret in learner: https://github.com/RRisto/seq2seq/blob/master/seq2seq/model/seq2seq_learner.py#L109

RRisto / seq2seq

Customise word embedding #3