Problem with MakeLMDB - always uses first pre-processed data set

markmuir87 commented 7 years ago

After completing training on the AN4 data set, I can't seem to get past the pre-processing step for the LibriSpeech data. No matter what arguments I provide or hard-code in MakeLMDB.lua, it keeps generating the LMDBs for the AN4 test and training sets (identical md5sums).

I've yet to test this out (and my Torch programming skills are rudimentary to say the least) but I think the problem might be in the second line of the 'createLMDB' function (in MakeLMDB.lua):

local sortIdsPath = 'sort_ids_'.. id .. '.t7' -- in case of crash, sorted ids are saved

I think what's happening is it's finding "sort_ids_test.t7" and "sort_ids_train.t7" in the root folder. This causes it to skip the if statement and fall through to the else statement:

vecs = torch.load(sortIdsPath)

The result being that it reloads the AN4 sorted ID paths but still uses the lmdbPath argument supplied by the user. So it just regenerates the same lmdb just in a different folder. Hilariously, if the user doesn't notice, they'll end up getting some really good WER and CER scores believing that all their training is paying off, when they're just massively over-fitting :)

Two easy ways to deal with this:

Just add a line in the documentation reminding people to delete the 'sortids*.t7' files between pre-processing runs. Might be a bit of a hack...
Add a line of code at the bottom of the (createLMDB function? I think?) to delete these files as they're no-longer needed after successful pre-processing (I think?)

markmuir87 commented 7 years ago

Just confirming that the above is correct. After deleting the sort_ids files (and the new lmdb directory) it's now pre-processing the new data set.

SeanNaren commented 7 years ago

Sorry I was late to this and I've actually ran into the same issue. I like the idea of deleting them after usage, I'll get to this as soon as I can! Thanks so much @markmuir87 and awesome display picture :D

markmuir87 commented 7 years ago

No problemo @SeanNaren , glad to help. Thanks for sharing such an awesome ML model which, for my money, is the 'openface' of speech recognition :)

SeanNaren commented 7 years ago

Removed those pesky sort ids for now in this!

SeanNaren / deepspeech.torch

Problem with MakeLMDB - always uses first pre-processed data set #78