Open Mchun-zhou opened 1 year ago
Hi, there! What are the reproducible results using our provided binarized data?
Hi, there! What are the reproducible results using our provided binarized data?
Hi,The results in the paper can indeed be achieved using the binary data you provided (including test_tm), but cannot be achieved using the data I processed. I don’t know whether it is because of environmental problems or problems with preprocessed data (tokenizer and bpe). The following is what I use Script for processing the dictionary and codes of the pre-trained model: 1.tokenizer and bpe
DAMAIN=$1
DATADIR=/home/npc/sk-mt-fairseq/process-data/multi_domain/$DAMAIN
BPEDATA=/home/npc/sk-mt-fairseq/process-data/bpe-data/$DAMAIN
HOME=/home/npc/sk-mt-fairseq/process-data
if [ -z $HOME ]
then
echo "HOME var is empty, please set it"
exit 1
fi
SCRIPTS=$HOME/mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
FASTBPE=$HOME/fastBPE
BPECODES=/home/npc/base-model/ende30k.fastbpe.code
VOCAB=/home/npc/base-model/dict.en.txt
if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi
mkdir ${BPEDATA}
filede=${DATADIR}/train.de
fileen=${DATADIR}/train.en
cat $filede | \
perl $NORM_PUNC $l | \
perl $REM_NON_PRINT_CHAR | \
perl $TOKENIZER -threads 8 -a -l de >> ${BPEDATA}/train.tok.de
cat $fileen | \
perl $NORM_PUNC $l | \
perl $REM_NON_PRINT_CHAR | \
perl $TOKENIZER -threads 8 -a -l en >> ${BPEDATA}/train.tok.en
$FASTBPE/fast applybpe ${BPEDATA}/train.bpe.de ${BPEDATA}/train.tok.de $BPECODES $VOCAB
$FASTBPE/fast applybpe ${BPEDATA}/train.bpe.en ${BPEDATA}/train.tok.en $BPECODES $VOCAB
#perl $CLEAN -ratio 1.5 ${BPEDATA}/train.bpe1 de en ${BPEDATA}/train.bpe 1 250
for split in dev test
do
filede=${DATADIR}/${split}.de
fileen=${DATADIR}/${split}.en
cat $filede | \
perl $TOKENIZER -threads 8 -a -l de >> ${BPEDATA}/${split}.tok.de
cat $fileen | \
perl $TOKENIZER -threads 8 -a -l en >> ${BPEDATA}/${split}.tok.en
$FASTBPE/fast applybpe ${BPEDATA}/${split}.bpe.de ${BPEDATA}/${split}.tok.de $BPECODES $VOCAB
$FASTBPE/fast applybpe ${BPEDATA}/${split}.bpe.en ${BPEDATA}/${split}.tok.en $BPECODES $VOCAB
done
2.fairseq binarized
domain=medical
python /home/npc/sk-mt-fairseq/fairseq_cli/preprocess.py --source-lang de --target-lang en --trainpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/train.bpe --validpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/dev.bpe --testpref /home/npc/sk-mt-fairseq/process-data/bpe-data/$domain/test.bpe --destdir /home/npc/sk-mt-fairseq/process-data/data-bin/$domain --srcdict /home/znpc/base-model/dict.de.txt --joined-dictionary
I don’t know if it is related to the elasticsearch version or there is a problem with the data processing process. Can you provide your data processing script? Thanks
Hello, could you provide a script for data processing? Thanks
Sorry for the late response. Your data processing is almost the same as ours. We are working on finding what resulted in the mismatch.
Thank you for your reply. I have tried many methods but the results are still quite different from the data you gave me. If you find the reason, I look forward to your reply and update. Thank you again.
Hello, could you kindly send your processed data with the corresponding retrieval samples to my email? The textual data is preferred. My email address is dirkiedye@gmail.com.
I have noticed that the reproduced results are evaluated in the test set, but the scores reported in our paper are evaluated in the development set for parameter selection. Please refer to the section of A.3 HYPER-PARAMETERS SELECTION.
And the framework of THUMT is officially maintained, where all the scores in our paper are tested. You can test your processed data within the THUMT framework.
Thank you very much. I will send it to you by email tomorrow, because the computer is not around now. Thank you very much again.
Thanks for your nice work! I am trying to reproduce results of multi-domain datasets. However, the results i get are quite different from the reported results in the paper. When I followed your guidance and started reproducing the experiment from text pre-retrieval, the final result using the retrieval-processed test_tm could not reach the reproducible result using the test_tm you gave. The hyperparameters are consistent with your paper. The following are my results. Except for the koran field, the results in other fields seriously fall short of the results of the paper :
The following is my script code for reproducing the law field: Please check if there is any problem. Among them, I use
pytorch=1.12.0, python=3.8, numpy=1.23.0, elasticseach=7.0.0,faiss-gpu=1.7.3
And the bpe processing of my data and fairseq binary processing are also consistent with the log information of the data you provided. 1.Retrieval2.process
3.Inference with SK-MT
Could you please help me find out what is the problem? Looking forward to your reply.