Outputted files have erroneous characters [bug on mac os]

mxl1n commented 3 years ago

Hi, I was trying to reproduce the results in the TransCoder paper, but ran into some issues when it computed the computational accuracy. It seems that when writing things such as id’s or outputted programs to files or outputs in the log, the text has some issues.

For example, in ids.java_sa-python_sa.text.txt (created by create_referencefiles in evaluator.py), the lines look like “CHECK@@ WHE@@ THER@@ GI@@ V@@ EN@@ NUMBER@@ EV@@ EN@@ O@@ DD”.

The scripts outputted by the model (e.g. in eval_scripts/java_sa-python_sa.test/) that are used to compute the computational accuracy similarly have erroneous characters and spaces (which causes syntax errors), e.g.:

def f_filled is_@@ ap ( arr , n ) :
    if n == 1 : return True
    arr.sort ( )
    d = arr [ 1 ] - arr [ 0 ]
    for i in range ( 2 , n ) :
        if arr [ i ] - arr [ i - 1 ] != d : return False
    return True

If it is relevant, I am on Mac OS Catalina with python 3.9 and this is the command I have been running to evaluate the TransCoder models provided:

python codegen_sources/model/train.py \
--eval_only True \
--reload_model 'TransCoder_model_1.pth,TransCoder_model_2.pth' \
--data_path "test_dataset" \
--exp_name transcoder \
--dump_path 'dump' \
--lgs 'java_sa-python_sa'  \
--bt_steps 'python_sa-java_sa-python_sa,java_sa-python_sa-java_sa'  \
--ae_steps 'python_sa,java_sa'  \
--mt_steps 'java_sa-python_sa,python_sa-java_sa' \
--encoder_only False \
--emb_dim 1024 \
--n_heads 8 \
--n_layers 0 \
--n_layers_encoder 6  \
--n_layers_decoder 6 \
--eval_bleu true \
--eval_computation true \
--has_sentences_ids true

Thank you.

baptisteroziere commented 3 years ago

Hi, I guess you are referring to the @@ characters ? These characters were created by fastBPE when breaking the tokens into subtokens. To undo the BPE, you can just call .replace("@@ ", "") on the strings you obtained. The function doing that should be called in evaluator.py if you run train.py.

Note: you should use either --reload_model 'TransCoder_model_1.pth,TransCoder_model_1.pth' or --reload_model 'TransCoder_model_2.pth,TransCoder_model_2.pth'. These models correspond to two different checkpoints with better validation scores for different language pairs and the --reload_model parameters takes a pair "encoder,decoder". So here you are reloading an encoder and a decoder coming from different checkpoint and that's likely to perform worse.

mxl1n commented 3 years ago

Thanks Baptiste! I realized the issue was that the function restore_fastBPE_segmentation in utils.py performs the replace you describe using sed, however the version of sed on Macs requires an additional argument when using the 'i' flag. (https://stackoverflow.com/questions/16745988/sed-command-with-i-option-in-place-editing-works-fine-on-ubuntu-but-not-mac)

(Anyhow I am using a linux machine now and do not have the issue anymore.)

akshitdewan commented 3 years ago

@maxl1n I'm trying to run the same script as you, but having some issues. Namely, it looks for the following files but can't find them

test_dataset/train.java_sa.pth not found
test_dataset/valid.java_sa.pth not found
test_dataset/test.java_sa.pth not found
test_dataset/train.python_sa.pth not found
test_dataset/valid.python_sa.pth not found
test_dataset/test.python_sa.pth not found
test_dataset/train.java_sa-python_sa.java_sa.pth not found
test_dataset/train.java_sa-python_sa.python_sa.pth not found

and gets stuck after this log message:

0 - Number of nodes: 1
0 - Node ID        : 0
0 - Local rank     : 0
0 - Global rank    : 0
0 - World size     : 1
0 - GPUs per node  : 1
0 - Master         : True
0 - Multi-node     : False
0 - Multi-GPU      : False
0 - Hostname       : <host>

Any ideas?

facebookresearch / CodeGen

Outputted files have erroneous characters [bug on mac os] #17