aws-samples / amazon-transcribe-comprehend-podcast

A demo application that transcribes and indexes podcast episodes so the listeners can explore and discover episodes of interest and podcast owners can do analytics on the content over time. This solution leverages Amazon Transcribe, Amazon Comprehend, Amazon ElasticSearch AWS Step Functions and AWS Lambda.
https://aws.amazon.com/blogs/machine-learning/discovering-and-indexing-podcast-episodes-using-amazon-transcribe-and-amazon-comprehend/
MIT No Attribution
135 stars 48 forks source link

Fail state executed in step: Processing Error #12

Closed yemaney closed 2 years ago

yemaney commented 2 years ago

Using the one-click deployment resulted in an error. Wondering if there are any clues as to why? I've checked the lambda function cloud watch logs and the worse I get is a warning that the urlib library used is out of data, but no actual errors.

image image

DanyStinson commented 2 years ago

Hi! It looks like when it is creating the custom vocabulary it's passing on to Transcribe incorrectly formatted words. I checked out the lambda function in charge of creating the vocabulary and in the example feed it is taking in the following terms

["-Hawn", "Cloud", "A-W-S-A-I-Services", "Amazon", "Code-Whisperer", "Pillir", "A-W-S", "S-A-P", "U-K-T-V", "Media-two-Cloud", "Marketplace", "Mainframe-Modernization", "E-M-R-Serverless", "E-M-R", "Apache", "Spark", "Hive", "Hawn", "Amazon-Connect", "Local-Measure", "low", "Intelligent-Automation"]

It seems that when creating a custom vocabulary "-word" (in this case -Hawn, is creating the issue) is not accepted, so the lambda function in charge of doing the preprocessing should be reviewed --> podcast-transcribe-index-createTranscribeVocabular***

Hope this helps!

yemaney commented 2 years ago

With @Dani Mitchells help, I found out the problem was caused because transcribe doesn't accept words starting with - when creating its custom vocabulary.

Solved it by adding an extra check for this case in the podcast-transcribe-index-createTranscribeVocabular function.

mapping[item] = origItem
#### check for words starting with '-'
if item[0] == "-":
        item = item[1:]
###
vocabularyTerms.append(item)