fgnt / sms_wsj

SMS-WSJ: Spatialized Multi-Speaker Wall Street Journal database for multi-channel source separation and recognition
MIT License
101 stars 23 forks source link

Question about cache/wsj_8k_zeromean/11-13.1/wsj0/doc/indices/train/tr_s_wv1.ndx #7

Closed sekiguchi92 closed 4 years ago

sekiguchi92 commented 4 years ago

Dear fgnt,

Now I'm trying to create SMS-WSJ dataset, but I have some problems.

While running python -m sms_wsj.database.wsj.create_json with json_path=$(JSON_DIR)/wsj_8k_zeromean.json database_dir=$(WSJ_8K_ZEROMEAN_DIR) as_wav=True in Makefile, I got following error:

File "sms_wsj/sms_wsj/database/wsj/create_json.py", line 146, in process_example_paths 'kaldi_transcription': transcript['kaldi'][example_id] KeyError: '401c0202'

I found that the cause is cache/wsj_8k_zeromean/11-13.1/wsj0/doc/indices/train/tr_s_wv1.ndx. Although 401c0202.wv1 is only in WSJ0_root/11-3.1/wsj0/si_tr_s/401/ in my case, tr_s_wv1.ndx has two lines about 401c0202 as follows:

Thus, when the program tried to access 11-2.1/wsj0/si_tr_s/401/401c0202.wv1, it stopped.

What is the cause of this problem ? Code ? WSJ0?

WARNING:Create wsj json:No observers have been added to this run

How can I solve this problem?

The numbers of files are different from what you expected. expected -> 'pl': 3, 'ndx': 106, 'ptx': 3547, 'dot': 3585, 'txt': 256 I found -> 'pl': 3, 'ndx': 106, 'ptx': 3073, 'dot': 3095, 'txt': 208 Does this cause a big problem?

I am sorry to cause you inconvenience, but I am looking forward to your reply.

jensheit commented 4 years ago

Thank you for your interest in SMS-WSJ, let me answer your questions step by step: First Problem: The ndx file does not seem to be the problem. We have the same two lines in our file. Did you run the setup for the KALDI wsj example? The problem might occur if you did not specify a working kaldi wsj data directory. You should have the directory $KALDI_ROOT/egs/wsj/s5/data/local/data If you have questions regarding how to set up KALDI please refer to the kaldi repository. When you have questions specifically to the required steps in the kaldi wsj run script, please open a new Issue.

Second Problem: This is just a warning and can be ignored, we are deliberately choosing not to use an observer here. However, we will discuss whether we can avoid the warning in a future update.

Third Problem: If you are missing some essential data, the script should raise an error. Therefore, I would assume the missing data are not a problem going forward.

sekiguchi92 commented 4 years ago

Thank you for your advice. I managed to create the dataset.