Open mxl1n opened 3 years ago
Hi,
I guess you are referring to the @@ characters ? These characters were created by fastBPE when breaking the tokens into subtokens. To undo the BPE, you can just call .replace("@@ ", "")
on the strings you obtained. The function doing that should be called in evaluator.py if you run train.py.
Note: you should use either --reload_model 'TransCoder_model_1.pth,TransCoder_model_1.pth'
or --reload_model 'TransCoder_model_2.pth,TransCoder_model_2.pth'
. These models correspond to two different checkpoints with better validation scores for different language pairs and the --reload_model parameters takes a pair "encoder,decoder". So here you are reloading an encoder and a decoder coming from different checkpoint and that's likely to perform worse.
Thanks Baptiste! I realized the issue was that the function restore_fastBPE_segmentation in utils.py performs the replace you describe using sed, however the version of sed on Macs requires an additional argument when using the 'i' flag. (https://stackoverflow.com/questions/16745988/sed-command-with-i-option-in-place-editing-works-fine-on-ubuntu-but-not-mac)
(Anyhow I am using a linux machine now and do not have the issue anymore.)
@maxl1n I'm trying to run the same script as you, but having some issues. Namely, it looks for the following files but can't find them
test_dataset/train.java_sa.pth not found
test_dataset/valid.java_sa.pth not found
test_dataset/test.java_sa.pth not found
test_dataset/train.python_sa.pth not found
test_dataset/valid.python_sa.pth not found
test_dataset/test.python_sa.pth not found
test_dataset/train.java_sa-python_sa.java_sa.pth not found
test_dataset/train.java_sa-python_sa.python_sa.pth not found
and gets stuck after this log message:
0 - Number of nodes: 1
0 - Node ID : 0
0 - Local rank : 0
0 - Global rank : 0
0 - World size : 1
0 - GPUs per node : 1
0 - Master : True
0 - Multi-node : False
0 - Multi-GPU : False
0 - Hostname : <host>
Any ideas?
Hi, I was trying to reproduce the results in the TransCoder paper, but ran into some issues when it computed the computational accuracy. It seems that when writing things such as id’s or outputted programs to files or outputs in the log, the text has some issues.
For example, in ids.java_sa-python_sa.text.txt (created by create_referencefiles in evaluator.py), the lines look like “CHECK@@ WHE@@ THER@@ GI@@ V@@ EN@@ NUMBER@@ EV@@ EN@@ O@@ DD”.
The scripts outputted by the model (e.g. in eval_scripts/java_sa-python_sa.test/) that are used to compute the computational accuracy similarly have erroneous characters and spaces (which causes syntax errors), e.g.:
If it is relevant, I am on Mac OS Catalina with python 3.9 and this is the command I have been running to evaluate the TransCoder models provided:
Thank you.