lexicon file missing - Githubissues

fwillett / speechBCI

Code for the paper "A high-performance speech neuroprosthesis"

141 stars 35 forks source link

lexicon file missing #16

Open bparbhu opened 7 months ago

bparbhu commented 7 months ago

Hey guys,

Happy to be part of this competition but one thing we noticed in our training runs is that there was no lexicon file in the original repo. There are a lot of references to it as lexicon.txt or a related files of that nature being in a temporary directory. If you can provide a link to a module or the actual files themselves. Also there's a words.txt file we were looking at that might be the source of creating the lexicon files but we're not sure because it's not included in the repository. Any feedback or help is much appreciated and thanks again for everything here.

-Oliver Shetler and Brian Parbhu

fwillett commented 6 months ago

Hi Oliver & Brian! The language model files you can download on dryad (https://datadryad.org/stash/dataset/doi:10.5061/dryad.x69p8czpq) should be all that's needed to run the 3-gram or 5-gram language model decoder (assuming you've also compiled the language model decoder https://github.com/fwillett/speechBCI/tree/main/LanguageModelDecoder). Once you've done those steps, you should be able to run the example inference notebook (https://github.com/fwillett/speechBCI/blob/main/AnalysisExamples/rnn_step3_baselineRNNInference.ipynb). Are you running into errors trying to do those steps, or do you just want to examine the lexicon.txt file? The lexicon.txt file was not included.

olivershetler commented 6 months ago

Hi Frank,

We're trying to examine the lexicon file. We want to make sure that the tokens are from only the train/test set sentences in the competition data. Our concern is that, given the relatively "small" size of the language lexicon, it's possible that our tokenizer would fail to find important tokens from the holdout set. If this happens, it might constrain the CTC beam-search algorithm from finding correct answers on the holdout set. We just want to make sure our tokenizer has the same linguistic inputs as yours for constructing the lexicon.

If you could clarify what linguistic data was used for lexicon construction, that would be very helpful as well.

Cheers, Oliver and Brian

fwillett commented 6 months ago

Hi again,

For information about the language model, see section 8 of the supplementary materials of the paper (https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-023-06377-x/MediaObjects/41586_2023_6377_MOESM1_ESM.pdf).

The language model contains all 125k words in the CMU dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict). The language model was not built using the small amount of text in the train/test set - it was built using OpenWebText2 (https://openwebtext2.readthedocs.io/en/latest/). Feel free to use large datasets outside of the competition for the language model part - we didn't intend for the language model to be constrained to only the text in the competition data, which I'm sure would miss some words in the hold out set, and would just be a poorly performing model in general.

Attaching the lexicon.txt file here.

Best, Frank