Open srewai opened 3 years ago
Is there a specific version of common voice dataset that should be downloaded? I am using version : en_2181h_2020-12-11
Having the same issue - I left generate_dataset.sh
running overnight and it didn't finish. Using cv-corpus-6.1-2020-12-11
, commit f8f5ac1147c033e6724a63ea470f7bd61ba77764
Looks like it tries to read 28303 clips for the negative dataset for my desired input of hey_computer
. Combined with #56, it's probably what's taking forever.
Workaround: Comment out the line print_stats('Dataset', ctx, train_ds, dev_ds, test_ds, compute_length=True)
in create_raw_dataset.py
It loads all the files just to print some stats to you, during which there's no feedback. The file data is theoretically cached in an LRU cache, but it's unlikely much will be reused.
The writing files step is still very slow, but at least it will give you a progress bar instead of just sitting there.
So it works for you ?
@srewai I'm making several code changes to hopefully resolve some of the inefficiencies in the training dataset creator, but I'm running into other issues atm. I'll publish a fork when I am done.
I'll take a look at this issue soon. IIRC the print_stats
function loads a lot of the files if compute_length
is True.
Hi, thanks for the great work. When I try to create positive dataset using the readme for keyword 'fire', it works fine but when i try to create the negative datset it hangs forever. ANy idea where might be the problem?