Open thangk opened 5 months ago
Hi @thangk thanks for the progress log.
Opennmt only gives you the translation metrics like ppl, as seen in the image.
@jamil2388 please advise
@hosseinfani, @thangk for now, I am putting a doc link here. This contains almost all sets of arguments used for onmt pipeline.
https://community.libretranslate.com/t/documentation-for-opennmt-py-parameters/927/
I think looking into this argument in the page might help us for prediction file dumping : –dump_preds
Also I advice Kap to learn about the behavior of the translation metrics used in the current runs. Because it will help crucially in understanding the model train and test behavior, eventually letting us know the direction of adjustments.
Thanks!
@jamil2388 thanks.
@thangk one more thing. when exploring hyperparameters, also see how you can use openmt for different type of translators. Because, we need to study the effect of translation for our work. These translators should be published in a paper such that we can cite them in the paper. I think openmt community update their codeline to include more and more new translators, which helps you for our task (this is like @jamil2388 using different gnn methods from pyg for team formation).
Hi @hosseinfani, I'll continue my question here if that's okay.
continuing conversation from whether or not to average all the folds' eval metrics to get one set of data for each epoch setting (ie. 500, 1000)
I was referring to these. Each fold produces its own eval metrics. There's one more, fold2, below fold1, which isn't visible in the screenshot. I am thinking the right approach is to average the e500 and e1000 pairs across all 3 folds to put in the excel.
I saw some charts we've used in some papers, and I can see those papers use the average of the folds. I'll follow the same approach.
Hi @thangk thabks for bringing the conversation here :)
now I see. There should be another file with no fold-idx, like test.epoch* that include the average of folds.
but you're right about average of folds
There should be another file with no fold-idx, like test.epoch* that include the average of folds.
Yes, I see one outside the fold folders.
@thangk my preference is to keep the progress logs like this issue, rather than chats in teams or else where.
Yesterday, I ran three (Transformer, ConvS2S, RNN with attention) seq2seq-based models on the dblp (filtered) dataset and out of the three, only two (ConvS2S and RNN with attention) ran successfully with the baseline configs I've set.
Here are the first run results for ConvS2S (left) and RNN with attention (right)
It seems there are issues with the shape of the input in the transformer model. I'll dig into the issue.
This was the first run of all datasets using the ConvS2S model.
Hyperparameters:
word_vec_size: 128
cnn_size: 512
layers: 15
cnn_kernel_width: 3
encoder_type: cnn
decoder_type: cnn
optim: adam
learning_rate: 0.001
learning_rate_decay: 0.9
start_decay_steps: 50
decay_steps: 50
batch_size: 4
dropout: 0.5
@thangk can you put the result of pure bnn and fnn, 1-hot skills in the input?
@thangk can you put the result of pure bnn and fnn, 1-hot skills in the input?
I was thinking of putting the best results from Jamil's FNN and BNN. Do you want me to put the pure FNN and BNN from Rad et al's paper?
yes, I believe Jamil has reproduced the results already.
yes, I believe Jamil has reproduced the results already.
yeah, he has the results for imdb and dblp. I'm gathering them for these tables.
@hosseinfani
This is what I currently have for imdb. I am working on dblp now. The transformer model isn't working quite right as it needs some more debugging.
dblp
ConvS2S
t99375.s29661.m14214.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t99375.s29661.m14214.etrnn.l512.wv256.lr0.0005.b16.e1000
imdb
ConvS2S
t32059.s23.m2011.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t32059.s23.m2011.etrnn.l512.wv256.lr0.0005.b16.e1000
hyperparameters for Run 3
:
# ConvS2S
word_vec_size: 256
cnn_size: 512
layers: 10
cnn_kernel_width: 3
encoder_type: cnn
decoder_type: cnn
optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4
# RNN
word_vec_size: 256
rnn_size: 512
layers: 2
encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM
optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4
Edit: dblp results added.
how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?
Hi @thangk
here is my reply:
regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?
how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?
Hi @thangk
here is my reply:
regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?
I see. I've also added the pure bnn, bnn_emb and rrn from other papers as the baselines. After doing this, my results aren't too far off, some are even better than the baselines. So, this validates my statement in the abstract.
Still, I'm eager to find more optimized hyperparameters and will do so. In the meantime, I'll keep these data and work more on the write-up. I'm also running the gith and uspt on both consvs2s and rnn with the same hyperparameters.
results for gith
and uspt
with convs2s
and rnn
using the same hyperparameters as the other two datasets
I ran a bunch of tests today to see how the metrics respond to the hyperparameters
3 produced the best result so far, besides AUCROC (which I'll also still work on), and I'll run more tests from this result.
I noticed that we hardcoded the checkpoints to 500 in the nmt.py
even though we have a field for it in the config file. I was wondering why my latest run with a large epoch count was using a lot of space. I've commented this out now so it shouldn't have this big space issue anymore.
It was using a lot of space
I'll delete this right away as soon as it's finished training. I've calculated how much more it'll take, and we have enough space to complete this training.
I was able to run the Transformer model with the following settings:
# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000
# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]
# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]
beam_size: 5
length_penalty: 1.0
And here's the result compared to the others:
We need to make the comparison between the nmt models themselves and the bnn and fnn models. So, please do:
This way, we argue that although we run the nmt models using more layers or epochs, and it may put them in an advantage compared to the bnn and fnn, however, the bnn and fnn cannot even accept such privilege of more epoch or layer for the same running time/memory.
@thangk let me know if you need more clarification.
@hosseinfani
Okay, I will redo the models with comparable settings as the FNN and BNN's. Apparantly the models I've posted in the tables are done with steps
instead of epochs
. I'll find the epoch values used in Jamil's FNN and BNN numbers I used in the table and do the math, then rerun at the same epochs.
I'll update the table again shortly.
@hosseinfani
I was able to run the Transformer model as apart of one of this week's task, finding one more architecture to include in the comparisons. The following results were ran 2-3 days ago (before we had the discussion about making as many settings same/similar as possible), that's why the settings aren't close. But it's to show, I was able to run one more model. I'll adjust the settings to be as close (and reasonably) as I can for future comparisons.
Note: also the epochs values seem strange because apparently, OpenNMT-py uses "steps" to determine the cycles instead of epochs. So, I realized this after these tests and I used the following formular to convert from steps to epochs which is why the strange epoch values. I'll address this better in future tests.
Formula for steps to epochs:
Steps per epoch = Size of sample / Batch size Num of epochs = Train steps / Steps per epochs
Transformer model, "gith1", on gith dataset
Transformer model, "imdb1", on imdb dataset
The CSV files are available in the OpeNTF - NMT
folder on MS Teams.
The hyperparameter settings used for the above:
Transformer gith1
# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 4000
valid_steps: 5000
train_steps: 200000
# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]
# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]
beam_size: 5
length_penalty: 1.0
Transformer imdb1
# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000
# Batching
bucket_size: 5000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]
# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]
beam_size: 5
length_penalty: 1.0
Thes settings are slightly different to accomondate the difference in dataset (i.e., larger dataset requires more train steps).
Again, this is to show a 3rd model is available to for future comparisons. I'll continue tweaking the settings to bring them as close as possible to the baselines we'd be comparing.
@thangk thanks for the update. just a quick note that please put the results of different datasets in different tables.
Update
The followings are current latest results for the 3 models (t-teamrec, c-teamrec, r-teamrec) at same settings. I'm working on finding the best settings for each of the model on each dataset and to be finished within the coming weeks (will update with further information). I'll do the same for two new models which I am also lookning to add to the research.
@thangk thank you. Few notes:
Tested dataset
data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3
Input type
Sparse matrix
Command used
python -u main.py -data ../data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3 -domain dblp -model nmt
Observations
The script ran through all 3 folds and produced results without errors, no predictions.
Next step(s)