Proper sanitization (or error handling) for vocabulary generation

aws-samples / amazon-transcribe-comprehend-podcast

A demo application that transcribes and indexes podcast episodes so the listeners can explore and discover episodes of interest and podcast owners can do analytics on the content over time. This solution leverages Amazon Transcribe, Amazon Comprehend, Amazon ElasticSearch AWS Step Functions and AWS Lambda.

https://aws.amazon.com/blogs/machine-learning/discovering-and-indexing-podcast-episodes-using-amazon-transcribe-and-amazon-comprehend/

MIT No Attribution

134 stars 48 forks source link

Proper sanitization (or error handling) for vocabulary generation #11

Open jcutrell opened 3 years ago

jcutrell commented 3 years ago

I'm getting this error during vocabulary generation on my feed:

The vocabulary that you’re trying to create contains invalid characters or incorrectly formatted terms. See the developer guide for more information.

Unfortunately the error doesn't provide any further information on the offending characters, or else I'd go and adjust them in the copy itself.

Perhaps it makes sense to sanitize or escape from this error state since the vocabulary can safely discard a minority of values?

DanyStinson commented 2 years ago

Hi! It looks like when it is creating the custom vocabulary it's passing on to Transcribe incorrectly formatted words. I checked out the lambda function in charge of creating the vocabulary and in the example feed it is taking in the following terms

["-Hawn", "Cloud", "A-W-S-A-I-Services", "Amazon", "Code-Whisperer", "Pillir", "A-W-S", "S-A-P", "U-K-T-V", "Media-two-Cloud", "Marketplace", "Mainframe-Modernization", "E-M-R-Serverless", "E-M-R", "Apache", "Spark", "Hive", "Hawn", "Amazon-Connect", "Local-Measure", "low", "Intelligent-Automation"]

It seems that when creating a custom vocabulary "-word" (in this case -Hawn, is creating the issue) is not accepted, so the lambda function in charge of doing the preprocessing should be reviewed --> podcast-transcribe-index-createTranscribeVocabular***

Hope this helps!

yemaney commented 2 years ago

With @DanyStinson help, I found out the problem was caused because transcribe doesn't accept words starting with - when creating its custom vocabulary.

Solved it by adding an extra check for this case in the podcast-transcribe-index-createTranscribeVocabular function.

mapping[item] = origItem
#### check for words starting with '-'
if item[0] == "-":
        item = item[1:]
###
vocabularyTerms.append(item)

DanyStinson commented 2 years ago

Glad this helped, I have notified the Service Team about this issue to include it in the documentation!

dchaplinsky commented 2 years ago

I had similar issue to that one, but it was caused by ' (apostrophe) at the beginning or at the end of the line. Hell lot of debug :(