Fail to Reproduce Results of Multi-Domain Dataset

Mchun-zhou commented 1 year ago

Thanks for your nice work! I am trying to reproduce results of multi-domain datasets. However, the results i get are quite different from the reported results in the paper. When I followed your guidance and started reproducing the experiment from text pre-retrieval, the final result using the retrieval-processed test_tm could not reach the reproducible result using the test_tm you gave. The hyperparameters are consistent with your paper. The following are my results. Except for the koran field, the results in other fields seriously fall short of the results of the paper :

koran: 19.52       paper: 18.9
it: 41.91              paper: 43.9
medical: 51.24    paper: 55.2
law: 55.50           paper: 61.6

The following is my script code for reproducing the law field: Please check if there is any problem. Among them, I use pytorch=1.12.0, python=3.8, numpy=1.23.0, elasticseach=7.0.0，faiss-gpu=1.7.3 And the bpe processing of my data and fairseq binary processing are also consistent with the log information of the data you provided. 1.Retrieval

PROJECT_PATH=/home/npc/sk-mt-fairseq
domain=law
type=test
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain

for split in train dev test
do
    paste -d '\t' $DATA_PATH/$split.bpe.de $DATA_PATH/$split.bpe.en > $DATA_PATH/$split.txt
done

python $PROJECT_PATH/bm25_retrieval.py \
    --build_index --search_index \
    --index_file $DATA_PATH/train.txt \
    --search_file $DATA_PATH/$type.txt \
    --output_file $DATA_PATH/$domain.$type \
    --index_name $domain --topk 64\
    --task domain_adaptation

2.process

PROJECT_PATH="/home/npc/sk-mt-fairseq"
domain=law
DATA_PATH="/home/npc/sk-mt-fairseq/process-data/bpe-data/$domain"
tmp_dir="/home/npc/sk-mt-fairseq/process-data/tmp_dir/$domain/$type"
type=test
max_t=64

python $PROJECT_PATH/data_clean.py \
        --input $DATA_PATH/$domain.$type \
        --output $tmp_dir --subset $type \
        --max-t $max_t\
        --task translation

DEST_PATH="/home/npc/sk-mt-fairseq/process-data/data-bin/$domain/test_tm"
DICT_PATH="/home/npc/base-model"
for i in $(seq 1 $max_t)
do
    if [ $type == 'dev' ]
    then
        fairseq-preprocess --validpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    else
        fairseq-preprocess --testpref $tmp_dir/${type}${i} -s de -t en --destdir $DEST_PATH/$i --srcdict $DICT_PATH/dict.de.txt --tgtdict $DICT_PATH/dict.en.txt --workers 20
    fi
done

3.Inference with SK-MT

MODEL_PATH=/home/npc/base-model/wmt19.de-en.ffn8192.pt
domain=law
OUTPUT_PATH=/home/npc/sk-mt-fairseq/output
DATA_PATH=/home/npc/sk-mt-fairseq/process-data/data-bin/$domain
#DATA_PATH=/home/npc/sk-mt-fairseq/binarized_data/$domain
 mkdir -p "$OUTPUT_PATH"

CUDA_VISIBLE_DEVICES=0 python3 experimental_generate.py $DATA_PATH \
    --gen-subset test \
    --path $MODEL_PATH --arch transformer_wmt19_de_en_with_datastore \
    --task translation_tm \
    --beam 4 --lenpen 0.6 --max-len-a 1.2 --max-len-b 10 --source-lang de --target-lang en \
    --scoring sacrebleu \
    --batch-size 16 \
    --tm-counts 16 \
    --fp16 \
    --tokenizer moses --remove-bpe \
    --model-overrides "{'load_knn_datastore': False, 'use_knn_datastore': True, 'dstore_fp16': True, 'k': 1, 'probe': 32,
    'knn_sim_func': 'do_not_recomp_l2', 'use_gpu_to_search': True, 'move_dstore_to_mem': True, 'no_load_keys': True,
    'knn_temperature_type': 'fix', 'knn_temperature_value': 100, 'knn_lambda_temperature_value': 100,
     }" \
    | tee "$OUTPUT_PATH"/generate_$domain.txt

Could you please help me find out what is the problem? Looking forward to your reply.

dirkiedai commented 1 year ago

Hi, there! What are the reproducible results using our provided binarized data?

Mchun-zhou commented 1 year ago

Hi, there! What are the reproducible results using our provided binarized data?

Hi,The results in the paper can indeed be achieved using the binary data you provided (including test_tm), but cannot be achieved using the data I processed. I don’t know whether it is because of environmental problems or problems with preprocessed data (tokenizer and bpe). The following is what I use Script for processing the dictionary and codes of the pre-trained model: 1.tokenizer and bpe

DAMAIN=$1
DATADIR=/home/npc/sk-mt-fairseq/process-data/multi_domain/$DAMAIN
BPEDATA=/home/npc/sk-mt-fairseq/process-data/bpe-data/$DAMAIN
HOME=/home/npc/sk-mt-fairseq/process-data
if [ -z $HOME ]
then
  echo "HOME var is empty, please set it"
  exit 1
fi
SCRIPTS=$HOME/mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
FASTBPE=$HOME/fastBPE
BPECODES=/home/npc/base-model/ende30k.fastbpe.code
VOCAB=/home/npc/base-model/dict.en.txt

if [ ! -d "$SCRIPTS" ]; then
  echo "Please set SCRIPTS variable correctly to point to Moses scripts."
  exit
fi

mkdir ${BPEDATA}

filede=${DATADIR}/train.de
fileen=${DATADIR}/train.en

cat $filede | \
  perl $NORM_PUNC $l | \
  perl $REM_NON_PRINT_CHAR | \
  perl $TOKENIZER -threads 8 -a -l de  >> ${BPEDATA}/train.tok.de

cat $fileen | \
  perl $NORM_PUNC $l | \
  perl $REM_NON_PRINT_CHAR | \
  perl $TOKENIZER -threads 8 -a -l en  >> ${BPEDATA}/train.tok.en

$FASTBPE/fast applybpe ${BPEDATA}/train.bpe.de ${BPEDATA}/train.tok.de $BPECODES $VOCAB
$FASTBPE/fast applybpe ${BPEDATA}/train.bpe.en ${BPEDATA}/train.tok.en $BPECODES $VOCAB

#perl $CLEAN -ratio 1.5 ${BPEDATA}/train.bpe1 de en ${BPEDATA}/train.bpe 1 250

for split in dev test
do
  filede=${DATADIR}/${split}.de
  fileen=${DATADIR}/${split}.en

  cat $filede | \
    perl $TOKENIZER -threads 8 -a -l de  >> ${BPEDATA}/${split}.tok.de

  cat $fileen | \
    perl $TOKENIZER -threads 8 -a -l en  >> ${BPEDATA}/${split}.tok.en

  $FASTBPE/fast applybpe ${BPEDATA}/${split}.bpe.de ${BPEDATA}/${split}.tok.de $BPECODES $VOCAB
  $FASTBPE/fast applybpe ${BPEDATA}/${split}.bpe.en ${BPEDATA}/${split}.tok.en $BPECODES $VOCAB
done

2.fairseq binarized

domain=medical
python /home/npc/sk-mt-fairseq/fairseq_cli/preprocess.py --source-lang de --target-lang en --trainpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/train.bpe --validpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/dev.bpe --testpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/test.bpe --destdir /home/npc/sk-mt-fairseq/process-data/data-bin/$domain --srcdict /home/znpc/base-model/dict.de.txt --joined-dictionary

I don’t know if it is related to the elasticsearch version or there is a problem with the data processing process. Can you provide your data processing script? Thanks

Mchun-zhou commented 1 year ago

Hello, could you provide a script for data processing? Thanks

dirkiedai commented 1 year ago

Sorry for the late response. Your data processing is almost the same as ours. We are working on finding what resulted in the mismatch.

Mchun-zhou commented 1 year ago

Thank you for your reply. I have tried many methods but the results are still quite different from the data you gave me. If you find the reason, I look forward to your reply and update. Thank you again.

dirkiedai commented 1 year ago

Hello, could you kindly send your processed data with the corresponding retrieval samples to my email? The textual data is preferred. My email address is dirkiedye@gmail.com.

I have noticed that the reproduced results are evaluated in the test set, but the scores reported in our paper are evaluated in the development set for parameter selection. Please refer to the section of A.3 HYPER-PARAMETERS SELECTION.

And the framework of THUMT is officially maintained, where all the scores in our paper are tested. You can test your processed data within the THUMT framework.

Mchun-zhou commented 1 year ago

Thank you very much. I will send it to you by email tomorrow, because the computer is not around now. Thank you very much again.

dirkiedai / sk-mt

Fail to Reproduce Results of Multi-Domain Dataset #5