Open YangXusheng-yxs opened 2 years ago
Encountered the same problem when trying to train monotonic models for simultaneous_translation examples. Using cuda-10.2 and gcc-7.5.0, git clone cub to $CUDA_HOME/targets/x86_64-linux/include/cub, get a lot of cub errors.
/home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/cache_modified_output_iterator.cuh(141): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(79): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(125): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(266): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(131): warning: parsing restarts here after previous syntax error 46 errors detected in the compilation of "/tmp/tmpxft_0002a1b3_00000000-6_alignment_train_kernel.cpp1.ii".
Encountered the same problem when trying to train monotonic models for simultaneous_translation examples. Using cuda-10.2 and gcc-7.5.0, git clone cub to $CUDA_HOME/targets/x86_64-linux/include/cub, get a lot of cub errors.
/home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/cache_modified_output_iterator.cuh(141): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(74): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/discard_output_iterator.cuh(79): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(120): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_obj_input_iterator.cuh(125): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(261): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/tex_ref_input_iterator.cuh(266): warning: parsing restarts here after previous syntax error /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: a class or namespace qualified name is required /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: qualified name is not allowed /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(126): error: expected a ";" /home/liumengge/cuda-10.2/bin/../targets/x86_64-linux/include/cub/iterator/transform_input_iterator.cuh(131): warning: parsing restarts here after previous syntax error 46 errors detected in the compilation of "/tmp/tmpxft_0002a1b3_00000000-6_alignment_train_kernel.cpp1.ii".
Using cuda 11.3 will solve cub not found error, the cub module will be installed with cuda.
I met the same question when I ran the example in https://github.com/pytorch/fairseq/blob/main/examples/simultaneous_translation/docs/ende-mma.md and my cuda version is 11.2
I found that git reset --hard dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3
could solve this problem
I found that git reset --hard dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3 could solve this problem
Instead of git reset, git checkout is better.
git checkout dd3bd3c0497ae9a7ae7364404a6b0a4c501780b3
and git checkout main
to go back to the main branch.
some useful tips I tried :
I thought I'd share my workaround for this issue:
Step 1: In utils/monotonic_attention.py, add the following lines:
import torch.utils.cpp_extension
from pathlib import Path
module_path = Path(__file__).parent.parent.parent / "operators"
build_path = module_path / "build"
build_path.mkdir(exist_ok=True)
alignment_train_cpu_binding = torch.utils.cpp_extension.load(
"alignment_train_cpu_binding",
sources=[
module_path / "alignment_train_cpu.cpp",
],
build_directory=build_path.as_posix()
)
alignment_train_cuda_binding = torch.utils.cpp_extension.load(
"alignment_train_cuda_binding",
sources=[
module_path / "alignment_train_cuda.cpp",
module_path / "alignment_train_kernel.cu"
],
build_directory=build_path.as_posix()
)
Step 2: In the same file, replace line 45 to 52 from here with:
alpha = p_choose.new_zeros([bsz, tgt_len, src_len])
if p_choose.is_cuda:
p_choose = p_choose.contiguous()
alignment_train_cuda_binding.alignment_train_cuda(p_choose, alpha, eps)
# from alignment_train_cuda_binding import alignment_train_cuda as alignment_train
else:
alignment_train_cpu_binding.alignment_train_cpu(p_choose, alpha, eps)
# from alignment_train_cpu_binding import alignment_train_cpu as alignment_train
# alpha = p_choose.new_zeros([bsz, tgt_len, src_len])
# alignment_train(p_choose, alpha, eps)
What it does is using ninja to compile the extension on the first import.
Hope this helps.
Are you using fairseq to reproduce MMA now ? I met some problem and could you please give me some help? Or ... Discuss.
SimulEval is the package for evaluating the model and it gives out the AP,AL,BLEU. Following this guide ende-mma , I trained a model of MMA , but I don’t know how to evaluate it. In another guide enja-waitk , I don’t know how to prepare the dataset .
I guess you are following the same guide, too . So do you know how to evaluate it ?
Hi, Yes I'm trying to reproduce MMA on SimulST, and so far waitk and MMA-IL looks good but MMA-H has poor performance.
First I'd suggest looking at this PR I made #4247 which includes fixes for some breaking bugs in the current fairseq code.
I assume you are working on SimulMT, so you'd need the text-to-text agent here. Next, if your target language is not Japanese (Ja), for example German (De) with sentencepiece subwords, then you need to update the def units_to_segment(self, units, states):
method here to merge the subwords. Here is my routine for merging:
def units_to_segment(self, unit_queue, states):
"""
queue: stores bpe tokens.
server: accept words.
Therefore, we need merge subwords into word. we find the first
subword that starts with BOW_PREFIX, then merge with subwords
prior to this subword, remove them from queue, send to server.
"""
# Merge sub word to full word.
tgt_dict = self.dict["tgt"]
# if segment starts with eos, send EOS
if tgt_dict.eos() == unit_queue[0]:
return DEFAULT_EOS
# if force finish, there will be None's
segment = []
if None in unit_queue.value:
unit_queue.value.remove(None)
src_len = len(states.units.source)
if (
(len(unit_queue) > 0 and tgt_dict.eos() == unit_queue[-1])
or len(states.units.target) > self.max_len
):
hyp = tgt_dict.string(
unit_queue,
"sentencepiece",
)
if self.pre_tokenizer is not None:
hyp = self.pre_tokenizer.decode(hyp)
return [hyp] + [DEFAULT_EOS]
for index in unit_queue:
token = tgt_dict.string([index])
if token.startswith(BOW_PREFIX):
if len(segment) == 0:
segment += [token.replace(BOW_PREFIX, "")]
else:
for j in range(len(segment)):
unit_queue.pop()
string_to_return = ["".join(segment)]
if tgt_dict.eos() == unit_queue[0]:
string_to_return += [DEFAULT_EOS]
return string_to_return
else:
segment += [token.replace(BOW_PREFIX, "")]
return None
You may also need to set self.pre_tokenizer
to task.pre_tokenizer
. (This would matter if you also use moses tokenizer).
After setting up the agent, and installing SimulEval as in the EnJa guide, you can evaluate by:
simuleval \
--source ${SRC_FILE} \
--target ${TGT_FILE} \
--data-bin ${DATA_BIN} \
--sacrebleu-tokenizer 13a \
--eval-latency-unit word \
--src-splitter-path ${SRC_SPM_PATH} \
--agent simul_t2t_enja.py \
--model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
--output ${OUTPUT} \
--scores
Hello, Zhang @George0828Zhang I've tried to use the detokenized dataset to evaluate it and finally I get a result ,however which is very bad . To reproduce , First I prepare the data with mose and sentencepiece on WMT14 en-de dataset , I followed the enja example and trained my model with wait-k for about 50 epoch
for Evaluate I tried use raw data(detokenized) and enja.agent .
this is part of my instance.log while evaluate the model
"index": 2734 "prediction": "Das Kom ite e empf ahl jedoch die FA A - Z ul ass ungs befug nis , die Pil oten bei der An land ung von Ger \u00e4ten in geringer Sicht zu bestellen . " "prediction_length": 36, "reference": "Das Komitee empfahl der FAA allerdings , es Piloten zu erlauben , bei Instrumentenlandung unter schlechten Sichtverh\u00e4ltnissen das Abschalten der Ger\u00e4te durch die Passagiere anzuordnen .", "source": "However , the committee recommended the FAA allow pilots to order passengers to shut off devices during instrument landings in low visibility .", "source_length": 23, "reference_length": 26, "metric": {"sentence_bleu": 1.5342333164810604, "latency": {"AL": 2.0740742683410645, "AP": 0.6759259104728699, "DAL": 5.527776718139648}}}
and the score is also very poor : The AL is neg ... that is strange but I don't know why did it happen
{
"Quality": {
"BLEU": 7.895211793562936
},
"Latency": {
"AL": -0.15109748336804218,
"AP": 0.49805060968859205,
"DAL": 3.3051336411581365
}
So My Question is :
Thank you !
I find the reason! that is because I use the units_to_segment from the enja.agent , actually , I should use yours:
def units_to_segment(self, unit_queue, states): """ queue: stores bpe tokens. server: accept words. Therefore, we need merge subwords into word. we find the first subword that starts with BOW_PREFIX, then merge with subwords prior to this subword, remove them from queue, send to server. """ # Merge sub word to full word. tgt_dict = self.dict["tgt"] # if segment starts with eos, send EOS if tgt_dict.eos() == unit_queue[0]: return DEFAULT_EOS # if force finish, there will be None's segment = [] if None in unit_queue.value: unit_queue.value.remove(None) src_len = len(states.units.source) if ( (len(unit_queue) > 0 and tgt_dict.eos() == unit_queue[-1]) or len(states.units.target) > self.max_len ): hyp = tgt_dict.string( unit_queue, "sentencepiece", ) if self.pre_tokenizer is not None: hyp = self.pre_tokenizer.decode(hyp) return [hyp] + [DEFAULT_EOS] for index in unit_queue: token = tgt_dict.string([index]) if token.startswith(BOW_PREFIX): if len(segment) == 0: segment += [token.replace(BOW_PREFIX, "")] else: for j in range(len(segment)): unit_queue.pop() string_to_return = ["".join(segment)] if tgt_dict.eos() == unit_queue[0]: string_to_return += [DEFAULT_EOS] return string_to_return else: segment += [token.replace(BOW_PREFIX, "")] return None
Hi, Yes I'm trying to reproduce MMA on SimulST, and so far waitk and MMA-IL looks good but MMA-H has poor performance.
Hello , Zhang ,MMA-H seems wrong while geneating , it won't generate eos () and the result is poor .Have you found out the solution?
By the way , How many epochs did you train for MMA-hard and MMA-waitk ?
Hi, Yes I'm trying to reproduce MMA on SimulST, and so far waitk and MMA-IL looks good but MMA-H has poor performance.
Hello , Zhang ,MMA-H seems wrong while geneating , it won't generate eos () and the result is poor .Have you found out the solution?
I reverted the expected alignment to earlier version and it is working fine now.
By the way , How many epochs did you train for MMA-hard and MMA-waitk ?
I'm training s2t so it might be different. But training for 20k steps is good already. Note that larger batch size is very important the batch size given in examples is not enough IMO. You should definitely use --update-frequency
with 4 or 8, or directly increase the batch. Also, if the gradient is unstable, use lower lr e.g. 1e-4 or something.
On my PR, revert the monotonic_attention.py should work fine. I'll add this change to my PR later.
George
On Fri, Mar 18, 2022 at 4:10 PM Eric Liang @.***> wrote:
I reverted the expected alignment to earlier version https://github.com/pytorch/fairseq/blob/98d638c70cdbe751153c10fc571c34beac228347/examples/simultaneous_translation/utils/monotonic_attention.py and it is working fine now. I think I should confirm it . Just revert the /monotonic_attention.py on your PR https://github.com/George0828Zhang/fairseq/tree/fix_mma ? Or git checkout the whole fairseq to 98d638c https://github.com/pytorch/fairseq/commit/98d638c70cdbe751153c10fc571c34beac228347 ?
— Reply to this email directly, view it on GitHub https://github.com/pytorch/fairseq/issues/4085#issuecomment-1072124503, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADHFODYO4IVH5JKYTMWYHZ3VAQ26HANCNFSM5KMAPI6Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Bug
Building 'alignment_train_cuda_binding' extension, CUB building problem -- Simultaneous Machine Translation Example
To Reproduce & Code sample
There is example how to use MMA model(Simultaneous Machine Translation) on page: https://github.com/pytorch/fairseq/blob/main/examples/simultaneous_translation/docs/ende-mma.md
First, I follow the main page to install fairseq.
I successfully installed fairseq.
Then, I tried to run example code:
But it doesn't work after commit with error, see error
(I noticed that this is a new module updated before 30 days)
To build relative extension package and pass the code 'from alignment_train_cuda_binding import alignment_train_cuda'
I set the CUDA_HOME path in ~/.bashrc and implemented the code in terminal
but when building 'alignment_train_cuda_binding' extension, see error
And I google this issue, someone say it should lack CUB package.
So I git clone latest version CUB and put it in path: '/usr/local/cuda-10.2/targets/x86_64-linux/include/cub'
Again, I implemented the code in terminal
But a new problem happened, see error
..............
error: command '/usr/local/cuda-10.2/bin/nvcc' failed with exit status 1