OpeNTF via NMT (OpenNMT)

thangk commented 5 months ago

Tested dataset

data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3

Input type

Sparse matrix

Command used

python -u main.py -data ../data/preprocessed/dblp/dblp.v12.json.filtered.mt75.ts3 -domain dblp -model nmt

Observations

The script ran through all 3 folds and produced results without errors, no predictions.

Next step(s)

Test with various parameters (optimization).
Test also with w2v input.
Test also with other datasets.
This test was ran on local lab PC. Test to make it also run on the Matrix server (as at the time of this post, the same codebase does not run on the Matrix server without errors).
Ultimately look to dockerize the project.

hosseinfani commented 5 months ago

Hi @thangk thanks for the progress log.

Just a quick note that you need to bring the prediction files and calculate the metrics we have in our codebase like precision, map, ndcg, ...

Opennmt only gives you the translation metrics like ppl, as seen in the image.

Also, schedule running nmt using gnn-based embeddings after w2v.

@jamil2388 please advise

jamil2388 commented 5 months ago

@hosseinfani, @thangk for now, I am putting a doc link here. This contains almost all sets of arguments used for onmt pipeline.

https://community.libretranslate.com/t/documentation-for-opennmt-py-parameters/927/

I think looking into this argument in the page might help us for prediction file dumping : –dump_preds

Also I advice Kap to learn about the behavior of the translation metrics used in the current runs. Because it will help crucially in understanding the model train and test behavior, eventually letting us know the direction of adjustments.

Thanks!

hosseinfani commented 5 months ago

@jamil2388 thanks.

@thangk one more thing. when exploring hyperparameters, also see how you can use openmt for different type of translators. Because, we need to study the effect of translation for our work. These translators should be published in a paper such that we can cite them in the paper. I think openmt community update their codeline to include more and more new translators, which helps you for our task (this is like @jamil2388 using different gnn methods from pyg for team formation).

thangk commented 5 months ago

Hi @hosseinfani, I'll continue my question here if that's okay.

continuing conversation from whether or not to average all the folds' eval metrics to get one set of data for each epoch setting (ie. 500, 1000)

I was referring to these. Each fold produces its own eval metrics. There's one more, fold2, below fold1, which isn't visible in the screenshot. I am thinking the right approach is to average the e500 and e1000 pairs across all 3 folds to put in the excel.

thangk commented 5 months ago

I saw some charts we've used in some papers, and I can see those papers use the average of the folds. I'll follow the same approach.

hosseinfani commented 5 months ago

Hi @thangk thabks for bringing the conversation here :)

now I see. There should be another file with no fold-idx, like test.epoch* that include the average of folds.

but you're right about average of folds

thangk commented 5 months ago

There should be another file with no fold-idx, like test.epoch* that include the average of folds.

Yes, I see one outside the fold folders.

hosseinfani commented 5 months ago

@thangk my preference is to keep the progress logs like this issue, rather than chats in teams or else where.

thangk commented 4 months ago

Yesterday, I ran three (Transformer, ConvS2S, RNN with attention) seq2seq-based models on the dblp (filtered) dataset and out of the three, only two (ConvS2S and RNN with attention) ran successfully with the baseline configs I've set.

Here are the first run results for ConvS2S (left) and RNN with attention (right)

It seems there are issues with the shape of the input in the transformer model. I'll dig into the issue.

thangk commented 3 months ago

This was the first run of all datasets using the ConvS2S model.

Hyperparameters:

word_vec_size: 128
cnn_size: 512
layers: 15
cnn_kernel_width: 3

encoder_type: cnn
decoder_type: cnn

optim: adam
learning_rate: 0.001
learning_rate_decay: 0.9
start_decay_steps: 50
decay_steps: 50
batch_size: 4
dropout: 0.5

hosseinfani commented 3 months ago

@thangk can you put the result of pure bnn and fnn, 1-hot skills in the input?

thangk commented 3 months ago

@thangk can you put the result of pure bnn and fnn, 1-hot skills in the input?

I was thinking of putting the best results from Jamil's FNN and BNN. Do you want me to put the pure FNN and BNN from Rad et al's paper?

hosseinfani commented 3 months ago

yes, I believe Jamil has reproduced the results already.

thangk commented 3 months ago

yes, I believe Jamil has reproduced the results already.

yeah, he has the results for imdb and dblp. I'm gathering them for these tables.

thangk commented 3 months ago

@hosseinfani

This is what I currently have for imdb. I am working on dblp now. The transformer model isn't working quite right as it needs some more debugging.

dblp

ConvS2S
t99375.s29661.m14214.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t99375.s29661.m14214.etrnn.l512.wv256.lr0.0005.b16.e1000

imdb

ConvS2S
t32059.s23.m2011.etcnn.l512.wv256.lr0.0005.b16.e1000
RNN
t32059.s23.m2011.etrnn.l512.wv256.lr0.0005.b16.e1000

hyperparameters for Run 3:

# ConvS2S
word_vec_size: 256
cnn_size: 512
layers: 10
cnn_kernel_width: 3

encoder_type: cnn
decoder_type: cnn

optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4

# RNN
word_vec_size: 256
rnn_size: 512
layers: 2

encoder_type: rnn
decoder_type: rnn
rnn_type: LSTM

optim: adam
learning_rate: 0.0005
learning_rate_decay: 0.95
start_decay_steps: 100
decay_steps: 100
batch_size: 16
dropout: 0.4

Edit: dblp results added.

hosseinfani commented 3 months ago

how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?

Hi @thangk

here is my reply:

regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?

thangk commented 3 months ago

how would I say this if I don't yet see a substantial performance improvement over the others yet? since it's a first for team formation, I'm not sure I even have the best settings yet. I've tried a few but they aren't still not as good as the fnn or bnn values. Can I say it has potential to be a viable option for team formation tasks yet needs further research?

Hi @thangk

here is my reply:

regarding the low performance of seq-2-seq, you need to know that these models can map a sentence to another one, that is the input space and output space are of size a language tokens (~100k), while keeping order between token. If they're not performing well, we need to find why? then how to change/customize them for our problem? For imdb, it makes sense becuase the input space is just 20-30 words, that should be mapped to a large output space. So, we can say the sparcity of source sequence/language. What else?

I see. I've also added the pure bnn, bnn_emb and rrn from other papers as the baselines. After doing this, my results aren't too far off, some are even better than the baselines. So, this validates my statement in the abstract.

Still, I'm eager to find more optimized hyperparameters and will do so. In the meantime, I'll keep these data and work more on the write-up. I'm also running the gith and uspt on both consvs2s and rnn with the same hyperparameters.

thangk commented 3 months ago

results for gith and uspt with convs2s and rnn using the same hyperparameters as the other two datasets

thangk commented 3 months ago

I ran a bunch of tests today to see how the metrics respond to the hyperparameters

3 produced the best result so far, besides AUCROC (which I'll also still work on), and I'll run more tests from this result.

thangk commented 3 months ago

I noticed that we hardcoded the checkpoints to 500 in the nmt.py even though we have a field for it in the config file. I was wondering why my latest run with a large epoch count was using a lot of space. I've commented this out now so it shouldn't have this big space issue anymore.

It was using a lot of space

I'll delete this right away as soon as it's finished training. I've calculated how much more it'll take, and we have enough space to complete this training.

thangk commented 3 months ago

I was able to run the Transformer model with the following settings:

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

And here's the result compared to the others:

hosseinfani commented 3 months ago

We need to make the comparison between the nmt models themselves and the bnn and fnn models. So, please do:

Run the nmt models for the same number of epochs and layers and layer size for nmt models. Most probably the results would be bad.
Then, increase one hyperparam, like number of epoch only, for the nmt models and bnn and fnn, like 1000. Most probably the time/memory bnn and fnn is a lot but the time/memory for nmt is tractable. So, simply report that that at the time that we got the nmt models results, the bnn and fnn models are still running
Then, go ahead with the number of layers, ....

This way, we argue that although we run the nmt models using more layers or epochs, and it may put them in an advantage compared to the bnn and fnn, however, the bnn and fnn cannot even accept such privilege of more epoch or layer for the same running time/memory.

@thangk let me know if you need more clarification.

thangk commented 3 months ago

@hosseinfani

Okay, I will redo the models with comparable settings as the FNN and BNN's. Apparantly the models I've posted in the tables are done with steps instead of epochs. I'll find the epoch values used in Jamil's FNN and BNN numbers I used in the table and do the math, then rerun at the same epochs.

I'll update the table again shortly.

thangk commented 3 months ago

@hosseinfani

I was able to run the Transformer model as apart of one of this week's task, finding one more architecture to include in the comparisons. The following results were ran 2-3 days ago (before we had the discussion about making as many settings same/similar as possible), that's why the settings aren't close. But it's to show, I was able to run one more model. I'll adjust the settings to be as close (and reasonably) as I can for future comparisons.

Note: also the epochs values seem strange because apparently, OpenNMT-py uses "steps" to determine the cycles instead of epochs. So, I realized this after these tests and I used the following formular to convert from steps to epochs which is why the strange epoch values. I'll address this better in future tests.

Formula for steps to epochs:

Steps per epoch = Size of sample / Batch size Num of epochs = Train steps / Steps per epochs

Transformer model, "gith1", on gith dataset

Transformer model, "imdb1", on imdb dataset

The CSV files are available in the OpeNTF - NMT folder on MS Teams.

The hyperparameter settings used for the above:

Transformer gith1

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 4000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 10000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

Transformer imdb1

# General opts
save_model: ../output/nmt/run/transformer_model
save_checkpoint_steps: 5000
warmup_steps: 8000
valid_steps: 5000
train_steps: 200000

# Batching
bucket_size: 5000
world_size: 1
gpu_ranks: [0]
num_workers: 8
batch_type: "tokens"
batch_size: 64
valid_batch_size: 128
accum_count: [1]

# Optimization
model_dtype: "fp16"
optim: adam
weight_decay: 0.0001
learning_rate: 1
decay_method: "noam"
adam_beta2: 0.998
learning_rate_decay: 0.95
decay_steps: 10000
max_grad_norm: 5
label_smoothing: 0.05
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model hyperparameters
encoder_type: transformer
decoder_type: transformer
position_encoding: true
max_relative_positions: 10
enc_layers: 4
dec_layers: 4
heads: 4
hidden_size: 256
word_vec_size: 256
transformer_ff: 2048
dropout: [0.3]
attention_dropout: [0.3]

beam_size: 5
length_penalty: 1.0

Thes settings are slightly different to accomondate the difference in dataset (i.e., larger dataset requires more train steps).

Again, this is to show a 3rd model is available to for future comparisons. I'll continue tweaking the settings to bring them as close as possible to the baselines we'd be comparing.

hosseinfani commented 3 months ago

@thangk thanks for the update. just a quick note that please put the results of different datasets in different tables.

thangk commented 2 weeks ago

Update

The followings are current latest results for the 3 models (t-teamrec, c-teamrec, r-teamrec) at same settings. I'm working on finding the best settings for each of the model on each dataset and to be finished within the coming weeks (will update with further information). I'll do the same for two new models which I am also lookning to add to the research.

hosseinfani commented 2 weeks ago

@thangk thank you. Few notes:

We need statistical significant test
We need more s2s variations

fani-lab / OpeNTF