Is this actually "encoder-decoder" vs. standard many-to-many?

wasd12345 commented 6 years ago

Thanks for sharing your code, very helpful.

From your computational graph and model code, it looks like the "decoder" at each timestep takes 2 inputs: 1) the previous hidden state from the decoder (hidden state of deocder GRU cell at previous timestep) 2) a concatenated vector of inputs = [previous prediction, features, attention] where attention is optional.

The first timestep decoder cell gets the "encoded state" as the last hidden state of the encoder. But future decoder timesteps do NOT get this encoded representation again. So the computational graph does not look like that in the original RNN encoder-deocder paper or like in the seq2seq encoder-decoder section of the Deep Learning book. I.e. it seems this model architecture is more like a standard many-to-many RNN but not encoder-decoder, right? I.e. you do not feed in the encoded state "c" again?

Thanks

Cho Encoder-decoder: https://arxiv.org/pdf/1406.1078.pdf Fig 1. on pg. 2

Deep Learning Book: http://www.deeplearningbook.org/contents/rnn.html Section 10.4 Encoder-Decoder Sequence-to-Sequence Architectures pg. 391 Fig 10.12

Your model: https://github.com/Arturus/kaggle-web-traffic/blob/master/how_it_works.md#model-core

wasd12345 commented 6 years ago

See this fork for the same model with the more typical encoder-decoder option: https://github.com/wasd12345/kaggle-web-traffic

--Technically, the above linked fork is not quite like the deep learning book either. It does not provide the context vector C when predicting y, instead it only provides context to the decoder states but at every timestep, not just the first decoder timestep.

After very basic comparison [just run both modes on my data for 10 epochs] of with vs. without providing the encoded state context vector C, seems like it does help: notice lower SMAPE for the encoder-decoder option:

NO encoder-decoder (as is):

vs.

WITH more typical encoder-decoder (in the linked fork):

Walk-forward validation SMAPEs printed out are consistently 1% to 2% better now. [**Also, these printed validation SMAPEs are the ones printed out during training phase, which is using an epsilon smoothed SMAPE on log1p transformed data. So the actual numbers are irrelevant but the point is the encoder-decoder context is better on a relative basis]

The wasd12345 fork uses a computational graph like this:

yyyyyyhm commented 6 years ago

How can I show the value of the loss?

wasd12345 commented 6 years ago

@yuhaomin By default in the Arturus original branch, I think it should run with forward split evaluation mode, so should print out the SMAPE loss for both train and validation stages. [Yo'll need to install the python package tqdm to do progress bar status printouts]

You can also do the option --side_split, although I agree with what Arturus said about the side_split numbers not being as useful. On my own data, I observed side_split SMAPEs an order of magnitude lower than forward eval [which are comparable to test sets I'm using so are believable], so the side_split numbers are not credible. So probably best to only run with forward eval.

yyyyyyhm commented 6 years ago

Thank you very much for your answer. But my results can't show smape loss, mae loss,... The result just shows Nan value.[ Best top SMAPE=nan (), frwd/side best MAE=nan/nan, SMAPE=nan/nan; avg MAE=nan/nan, SMAPE=nan/nan, 3 active models]. I'm not sure how to display the values.

wasd12345 commented 6 years ago

@yuhaomin in his readme, the original branch Arturus gave the training command like this: python trainer.py --name s32 --hparam_set=s32 --n_models=3 --name s32 --no_eval --no_forward_split --asgd_decay=0.99 --max_steps=11500 --save_from_step=10500

That command is not doing evaluation, so: Get rid of the no_eval option to do evaluation. Remove the no_forward_split to do forward split evaluation, so use something like this: python trainer.py --name=s32 --hparam_set=s32 --n_models=3 --asgd_decay=0.99 --max_steps=11500 --save_from_step=1000 --max_epoch=50

Are you using your own data? If yes, make sure you don't have Nans in the data that corrupt the encoder state.

yyyyyyhm commented 6 years ago

I appreciate your answer. In fact, I tried to change the parameters(python trainer.py --name=s32 --hparam_set=s32 --n_models=3 --asgd_decay=0.99 --max_steps=11500 --save_from_step=1000 --max_epoch=50), but it can't work. This problem has been bothering me for a few days. Once I changed the parameters, the project would not work, like this (ValueError: Variable m_0/gru_cell_1/w_ru does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=tf.AUTO_REUSE in VarScope?) Did you meet the same problem?

wasd12345 commented 6 years ago

No, I don't think I had that error, but I did have other issues with all Nan encoder state leading to all NAN metrics that were related to the train_completeness_threshold. You're using the same python/tensorflow versions? Are you using this Arturus branch or the wasd12345 fork? You're using that Kaggle data or your own?

yyyyyyhm commented 6 years ago

Thank you! May be the error caused by version of tensorflow (tensosflow 1.6.) And I also try to use my data use by this project.

Arturus commented 6 years ago

@wasd12345 I did not think that providing last encoder state for every decoder step would help, but as your results show, it really helps to get better results. So yes, your "classic" decoder variant is better than mine, thank you for sharing this!

Btw, I updated model code to work with TF 1.10

Arturus / kaggle-web-traffic

Is this actually "encoder-decoder" vs. standard many-to-many? #22