Closed yemaney closed 2 years ago
Hi! It looks like when it is creating the custom vocabulary it's passing on to Transcribe incorrectly formatted words. I checked out the lambda function in charge of creating the vocabulary and in the example feed it is taking in the following terms
["-Hawn", "Cloud", "A-W-S-A-I-Services", "Amazon", "Code-Whisperer", "Pillir", "A-W-S", "S-A-P", "U-K-T-V", "Media-two-Cloud", "Marketplace", "Mainframe-Modernization", "E-M-R-Serverless", "E-M-R", "Apache", "Spark", "Hive", "Hawn", "Amazon-Connect", "Local-Measure", "low", "Intelligent-Automation"]
It seems that when creating a custom vocabulary "-word" (in this case -Hawn, is creating the issue) is not accepted, so the lambda function in charge of doing the preprocessing should be reviewed --> podcast-transcribe-index-createTranscribeVocabular***
Hope this helps!
With @Dani Mitchells help, I found out the problem was caused because transcribe doesn't accept words starting with -
when creating its custom vocabulary.
Solved it by adding an extra check for this case in the podcast-transcribe-index-createTranscribeVocabular
function.
mapping[item] = origItem
#### check for words starting with '-'
if item[0] == "-":
item = item[1:]
###
vocabularyTerms.append(item)
Using the one-click deployment resulted in an error. Wondering if there are any clues as to why? I've checked the lambda function cloud watch logs and the worse I get is a warning that the urlib library used is out of data, but no actual errors.