SpeechColab / Leaderboard

SpeechIO Leaderboard: a large, robust, comprehensive, benchmarking platform for Automatic Speech Recognition.
434 stars 61 forks source link

Modify aliyun_ftasr sdk #36

Closed Lizerui9926 closed 2 years ago

Lizerui9926 commented 2 years ago

Modify the method of calling aliyun_ftasr sdk to parallel to improve efficiency.

Lizerui9926 commented 2 years ago

Thank you for looking into this PR. 1.Sorry, I don't quite understand your first question. We encapsulate trans.write into a function to achieve parallel calls. We modified our code and changed the single-process sequential execution to multi-process parallel execution in order to improve recognition speed. 2.Our recording file recognition is an offline recognition service, and there is no limit to the number of concurrent. And We have limited the number of multi-processes to half the number of cpu cores in the new commit.

We have performed multiple validations of our submission on the UNLOCKED test sets to ensure proper execution.

dophist commented 2 years ago

I just tried to test the already-merged aliyun_ftasr_api_zh, got:

log.SBI:

Invalid line: e5oipIfM49I__20190201_CCTV_1  Error: accessKeyID and ecsUrl are both empty
Invalid line: e5oipIfM49I__20190201_CCTV_104    Error: accessKeyID and ecsUrl are both empty

your submission is not working. Please debug, fix, and test.

For this particular PR (parallel improvement), my questions are:

  1. the trans variable in asr() seems to be undefined to me, it is neither in global scope, nor in local scope of asr(), and I don't see there is a lambda capturing it. I might be wrong, but please explain.
  2. you are writing a single file in a multiprocess program, even you flush() right away, there might still be IO contentions. Without locks or queues, the result can be messed up. If not, that's just lucky to me. Each benchmark costs us ~300 Yuan, we can't rely on luck.
  3. During the benchmarking, there are almost 10 models running in parallel, Aliyun is one of them. It's problematic for one model to use half of those CPUs. Please make the number of CPUs configurable and with default value of 5.
dophist commented 2 years ago

And one more thing, does this PR mean that your submission will fork thousands of subprocesses during the benchmarking, given a test set with thousands of utterances?

I would not do this in practice, you may consider splitting the test set into 10 parts or so, and run each part in a separate process with its own IO traffics, and merge them in the end.

Lizerui9926 commented 2 years ago

This problem of "Error: accessKeyID and ecsUrl are both empty" may be that the oss configuration file is not generated in the home directory. (The default path of the configuration file is: /home/user/.ossutilconfig). We will first roll back to the code that has been merged, and fix and debug based on it.

Lizerui9926 commented 2 years ago

We modified the pipeline to specify the configuration file path, and added the -c option to specify the configuration file every time when the ossutil64 command is used. We verified on our machine, generated the oss configuration file in the specified path, and successfully got the return result.

Lizerui9926 commented 2 years ago

Thanks for your suggestion. Now we split the test set into several parts for testing, write the asr results into txt file respectively, and finally merge them into _rawrec.txt.

Lizerui9926 commented 2 years ago

Now, we specify the _$ossscp(_${dir}/wavoss.scp) and other temp files in the $dir and write the myossconfig file to $dir as well. We add ossutil64 binary and its MD5 file. If MD5 check does not meet expectations, the pipeline will be exited.

dophist commented 2 years ago

Is there a way to avoid the generation of myossconfig? Like, credentials (AK and AK secret) are passed through ossutil's arguments directly, that would be better. Because I don't expect the result dir containing any "secret" stuff, the contents of result dir should be shareable to public, someday. Thanks

Lizerui9926 commented 2 years ago

The default path of myossconfig is /home/user/.ossutilconfig, but if the path is not specified, the default config file may not be generated, resulting in "Error: accessKeyID and ecsUrl are both empty" problem. Maybe we can delete myossconfig after executing the pipeline every time, is this feasible? Thanks.

dophist commented 2 years ago

Does ossutil have options for AK and AK secret?

dophist commented 2 years ago

One more thing, I noticed oss://speechiotest/ bucket is used across all tests. Can we improve to use a unique bucket name so that different benchmarks will never have storage overlaps, may be something like oss://speechiotest_unique_id, where the unique_id is generated from test set id, or even a uuid-like identifier seeded by system time.

Lizerui9926 commented 2 years ago

Thanks for you suggestion. We add system time as unique_id and modify the bucket name. And we remove ossutil64 config command and add the -i and -k option for AK and AK secret. We verify it on our machine and it is successful.

dophist commented 2 years ago

Thanks. Seems good to me now. I'll merge this PR right before this season's benchmark (probably at Aug 20th), feel free to improve or fix this PR before that.