banjtheman commented 3 years ago

What is the Task

We want to be able to transcribe audio files from openmhz

Why do we want to do this

In order to capture radio data

How can I get started?

TODO

Definition of Done

Transcribed audio data is stored in the database

banjtheman commented 3 years ago

Current Code

Have created a script (https://github.com/CharlotteJackson/DC_Crash_Bot/blob/audio_transcirbe/scripts/transcribe_audio.py) that does the following...

Identify calls from openmhz
- This is the API: https://api.openmhz.com/dcfd/calls/
Get URI for audio
Download audio file
Convert audio file to .wav
Use https://pypi.org/project/SpeechRecognition/ to transcribe audio

Open Questions:

Not sure where talkgroup is mapped in openmhz
Need to get an estimate of how many calls we will transcribe, 60 free minutes a month on cloud services
- Perhaps we can use AWS and Google to get 120 minutes of free transcription
Is it possible to geotag calls?
Lots of calls are short, is there value in transcribing these calls?
How often do we want to check for calls?
How long are calls stored in opehnmz?

CharlotteJackson commented 3 years ago

Not sure where talkgroup is mapped in openmhz - In the "talk group" key of the API response. We're interested in talk group 101 (dispatch) and 728/729 (EMS 5 and 6) Need to get an estimate of how many calls we will transcribe, 60 free minutes a month on cloud services -500 car crash calls a month give or take, say each dispatch call is 30 seconds Perhaps we can use AWS and Google to get 120 minutes of free transcription Is it possible to geotag calls? -hopefully we can map this data to the Pulsepoint API using call time and unit numbers, which has the geotag Lots of calls are short, is there value in transcribing these calls? -probably not - dispatch is going to be most important How often do we want to check for calls? -scrape say once an hour? How long are calls stored in opehnmz? -For the past 30 days

banjtheman commented 3 years ago

Made update, can run the following workflow

Query for all calls at current timestamp for talkgroups 101 728, or 729
Download matching calls
Upload the calls to an s3 bucket
Run AWS transcribe on calls
Save JSON output of call data

The next steps will be

Create a table for audio data in the database
Setup the script to the dc_crash_bot server
Create a cron job for the script to run
Convert JSON output to match the database schema
See if we can match audio data with pulse point data

banjtheman commented 3 years ago

Here is an example output

  {
    "id": "609c91e7c565b14d6ccb05f3",
    "source": 101,
    "audio_url": "https://s3.us-east-2.wasabisys.com/openmhz/media/dcfd-101-1620873678.m4a",
    "timestamp": "2021-05-13T02:41:18.000Z",
    "call_length": 19,
    "transcribed_audio": "Medical Local 26 respond to L. S. Person down 14th Rhode Island Avenue Northeast offered on channel 0 11. Medical. Local 26 respond to L. S. A. Person down 14 to Rhode Island Avenue Northeast station will be in a black escalade 7 11 parking lot operate on channel 0 11. At 22 41."
  },

CharlotteJackson commented 3 years ago

whooooo hooo we got it running! :)

CharlotteJackson / DC_Crash_Bot

Transcibe audio from openmhz #81

What is the Task

Why do we want to do this

How can I get started?

Definition of Done

Current Code

Open Questions: