not able to create negative dataset

castorini / howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.

Mozilla Public License 2.0

201 stars 30 forks source link

not able to create negative dataset #86

Open srewai opened 3 years ago

srewai commented 3 years ago

Hi, thanks for the great work. When I try to create positive dataset using the readme for keyword 'fire', it works fine but when i try to create the negative datset it hangs forever. ANy idea where might be the problem?

srewai commented 3 years ago

Is there a specific version of common voice dataset that should be downloaded? I am using version : en_2181h_2020-12-11

ColonelThirtyTwo commented 3 years ago

Having the same issue - I left generate_dataset.sh running overnight and it didn't finish. Using cv-corpus-6.1-2020-12-11, commit f8f5ac1147c033e6724a63ea470f7bd61ba77764

ColonelThirtyTwo commented 3 years ago

Looks like it tries to read 28303 clips for the negative dataset for my desired input of hey_computer. Combined with #56, it's probably what's taking forever.

ColonelThirtyTwo commented 3 years ago

Workaround: Comment out the line print_stats('Dataset', ctx, train_ds, dev_ds, test_ds, compute_length=True) in create_raw_dataset.py

It loads all the files just to print some stats to you, during which there's no feedback. The file data is theoretically cached in an LRU cache, but it's unlikely much will be reused.

The writing files step is still very slow, but at least it will give you a progress bar instead of just sitting there.

srewai commented 3 years ago

So it works for you ?

ColonelThirtyTwo commented 3 years ago

@srewai I'm making several code changes to hopefully resolve some of the inefficiencies in the training dataset creator, but I'm running into other issues atm. I'll publish a fork when I am done.

daemon commented 3 years ago

I'll take a look at this issue soon. IIRC the print_stats function loads a lot of the files if compute_length is True.