IBM / Train-Custom-Speech-Model

Create a custom Watson Speech to Text model using specialized domain data
https://developer.ibm.com/patterns/customize-and-continuously-train-your-own-watson-speech-service/
Apache License 2.0
59 stars 42 forks source link

inconsistency between preparing data and using it in UI #44

Closed rhagarty closed 5 years ago

rhagarty commented 5 years ago

When preparing the corpus data, we tell the user to issue the following command:

sed -f fixup.sed Documents/*.txt > corpus-1.input

But when trying to upload the corpus file in the UI, it only allows txt files.

rhagarty commented 5 years ago

I modified the file name to corpus-1.txt and tried to upload via the UI I see the following error in the logs:


PayloadTooLargeError: request entity too large
[0]     at readStream (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/raw-body/index.js:155:17)
[0]     at getRawBody (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/raw-body/index.js:108:12)
[0]     at read (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/body-parser/lib/read.js:77:3)
[0]     at jsonParser (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/body-parser/lib/types/json.js:135:5)
[0]     at Layer.handle [as handle_request] (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express/lib/router/layer.js:95:5)
[0]     at trim_prefix (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express/lib/router/index.js:317:13)
[0]     at /Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express/lib/router/index.js:284:7
[0]     at Function.process_params (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express/lib/router/index.js:335:12)
[0]     at next (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express/lib/router/index.js:275:10)
[0]     at Immediate.<anonymous> (/Users/rhagarty/journeys/Train-Custom-Speech-Model/node_modules/express-session/index.js:489:7)```
tonanhngo commented 5 years ago

Hi Rich, good that you caught the inconsistency in the file naming between the GUI and the command line. I did not use the .txt extension because the source files already use this extension and I was using the "*.txt" expression. Renaming should make it consistent with what the GUI wants. The error is apparently because there is some limit in the GUI code.
@yhwang @pvaneck Is this correct? Can we bump up or remove the limit?

yhwang commented 5 years ago

the default limitation at bodyparser is 100kb. It's small. I checked the Watson Speech to Text api, I can't find the file size limitation of addCorpus api. We definitely need to increase the file size limitation at our end. The question is what the proper size is.

@rhagarty can you share the file size of your corpus-1.txt?

tonanhngo commented 5 years ago

@yhwang The text file is 669Kb. Note that the audio files are 60-85MB and the upload seems to work. I guess there is a different size limit for those?

yhwang commented 5 years ago

@tonanhngo there are two handlers in our code, bodyparser and multer. the audio file is handled by multer. they have different limitation.

I think usually audio file should be bigger then corpus file. 1 or 2 MB text file should be pretty big already.

tonanhngo commented 5 years ago

We can split up the text file into multiple corpus and upload them individually. We just need to document the size limit so the user knows how to split the file. The limit for the audio file is 100MB (Watson API limit). I guess 2MB is reasonable for the text corpus.

yhwang commented 5 years ago

okay, let's use 2MB for corpus. Let me also check if we put limit for the audio file