Closed Lizerui9926 closed 2 years ago
Thank you for looking into this PR. 1.Sorry, I don't quite understand your first question. We encapsulate trans.write into a function to achieve parallel calls. We modified our code and changed the single-process sequential execution to multi-process parallel execution in order to improve recognition speed. 2.Our recording file recognition is an offline recognition service, and there is no limit to the number of concurrent. And We have limited the number of multi-processes to half the number of cpu cores in the new commit.
We have performed multiple validations of our submission on the UNLOCKED test sets to ensure proper execution.
I just tried to test the already-merged aliyun_ftasr_api_zh, got:
log.SBI:
Invalid line: e5oipIfM49I__20190201_CCTV_1 Error: accessKeyID and ecsUrl are both empty
Invalid line: e5oipIfM49I__20190201_CCTV_104 Error: accessKeyID and ecsUrl are both empty
your submission is not working. Please debug, fix, and test.
For this particular PR (parallel improvement), my questions are:
trans
variable in asr() seems to be undefined to me, it is neither in global scope, nor in local scope of asr(), and I don't see there is a lambda capturing it. I might be wrong, but please explain. And one more thing, does this PR mean that your submission will fork thousands of subprocesses during the benchmarking, given a test set with thousands of utterances?
I would not do this in practice, you may consider splitting the test set into 10 parts or so, and run each part in a separate process with its own IO traffics, and merge them in the end.
This problem of "Error: accessKeyID and ecsUrl are both empty" may be that the oss configuration file is not generated in the home directory. (The default path of the configuration file is: /home/user/.ossutilconfig). We will first roll back to the code that has been merged, and fix and debug based on it.
We modified the pipeline to specify the configuration file path, and added the -c option to specify the configuration file every time when the ossutil64 command is used. We verified on our machine, generated the oss configuration file in the specified path, and successfully got the return result.
Thanks for your suggestion. Now we split the test set into several parts for testing, write the asr results into txt file respectively, and finally merge them into _rawrec.txt.
Now, we specify the _$ossscp(_${dir}/wavoss.scp) and other temp files in the $dir and write the myossconfig file to $dir as well. We add ossutil64 binary and its MD5 file. If MD5 check does not meet expectations, the pipeline will be exited.
Is there a way to avoid the generation of myossconfig? Like, credentials (AK and AK secret) are passed through ossutil's arguments directly, that would be better. Because I don't expect the result dir containing any "secret" stuff, the contents of result dir should be shareable to public, someday. Thanks
The default path of myossconfig is /home/user/.ossutilconfig, but if the path is not specified, the default config file may not be generated, resulting in "Error: accessKeyID and ecsUrl are both empty" problem. Maybe we can delete myossconfig after executing the pipeline every time, is this feasible? Thanks.
Does ossutil
have options for AK and AK secret?
One more thing, I noticed oss://speechiotest/ bucket is used across all tests. Can we improve to use a unique bucket name so that different benchmarks will never have storage overlaps, may be something like oss://speechiotest_unique_id
, where the unique_id is generated from test set id, or even a uuid-like identifier seeded by system time.
Thanks for you suggestion. We add system time as unique_id and modify the bucket name. And we remove ossutil64 config command and add the -i and -k option for AK and AK secret. We verify it on our machine and it is successful.
Thanks. Seems good to me now. I'll merge this PR right before this season's benchmark (probably at Aug 20th), feel free to improve or fix this PR before that.
Modify the method of calling aliyun_ftasr sdk to parallel to improve efficiency.