bytedance / neurst

Neural end-to-end Speech Translation Toolkit
Other
298 stars 45 forks source link

how to use ASR ctcbert csv dataset? #35

Closed lidh15 closed 2 years ago

lidh15 commented 3 years ago

what the bert item required is? I tried to generate some vector representation of my input utterance but it's not correct.

dqqcasia commented 3 years ago

what the bert item required is? I tried to generate some vector representation of my input utterance but it's not correct.

@lidh15 Hi, thank you for your interest in our work. May I ask your error message and what is your input?

An utterance of the .csv file with ctcbert label of TED_EnZh dataset is shown below:

data/ted_en_zh/tst2015/95119_0000000-0003561.wav 这 是 我 的 侄@@ 女 , 斯特@@ 拉 。 this is my niece stella data/ted_en_zh/processed_berttok/bert_feature/tst2015/base-tst2015-output.jsonl0000

Data in the file ( "base-tst2015-output.jsonl0000" ) is the representation of the corresponding source language sentence. extracted by BERT (https://github.com/google-research/bert), and the dimension is [transcript_seq_len, bert_hidden_len].

If you have any question, you can email me at dongqianqian@bytedance.com or dongqianqian2016@ia.ac.cn.

lidh15 commented 3 years ago

hi, I was using pytorch bert previously and now with tf bert I got the jsonl files, but I found that 34 wavs out of 835 in dev2010 were damaged. I thought researchers use dev2010 as dev and tst2015 as test, am I right? Is it a known issue that some of dev2010 were damaged?

dqqcasia commented 2 years ago

hi, I was using pytorch bert previously and now with tf bert I got the jsonl files, but I found that 34 wavs out of 835 in dev2010 were damaged. I thought researchers use dev2010 as dev and tst2015 as test, am I right? Is it a known issue that some of dev2010 were damaged?

@lidh15 Hi, I guess you are talking about the TED_EnZh dataset released by liu et al (2019). Yes, the original released version has this problem. I also used this version. But the author later updated it once and repaired the damaged files.

lidh15 commented 2 years ago

I don't know where I can find the updated dataset, would you please share me a link or how to fix the files? But before that, a really important point is that, do we have pytorch or tf2 version? tf1 is so far from my daily environment and it's really difficult for me to make it work with my tf2 and pytorch smoothly. I tried to use some tf1-tf2 transfer scripts but they basically didn't work.

dqqcasia commented 2 years ago

I don't know where I can find the updated dataset, would you please share me a link or how to fix the files? But before that, a really important point is that, do we have pytorch or tf2 version? tf1 is so far from my daily environment and it's really difficult for me to make it work with my tf2 and pytorch smoothly. I tried to use some tf1-tf2 transfer scripts but they basically didn't work.

@lidh15 Hi, for questions about the dataset, you can consult the original author. For the code version, we are working on the tf2 version now. We will let you know when it is done. Good luck!