CharlotteJackson / DC_Crash_Bot

10 stars 7 forks source link

Transcibe audio from openmhz #81

Closed banjtheman closed 3 years ago

banjtheman commented 3 years ago

What is the Task

We want to be able to transcribe audio files from openmhz

Why do we want to do this

In order to capture radio data

How can I get started?

TODO

Definition of Done

Transcribed audio data is stored in the database

banjtheman commented 3 years ago

Current Code

Have created a script (https://github.com/CharlotteJackson/DC_Crash_Bot/blob/audio_transcirbe/scripts/transcribe_audio.py) that does the following...

Open Questions:

  1. Not sure where talkgroup is mapped in openmhz
  2. Need to get an estimate of how many calls we will transcribe, 60 free minutes a month on cloud services
    • Perhaps we can use AWS and Google to get 120 minutes of free transcription
  3. Is it possible to geotag calls?
  4. Lots of calls are short, is there value in transcribing these calls?
  5. How often do we want to check for calls?
  6. How long are calls stored in opehnmz?
CharlotteJackson commented 3 years ago

Not sure where talkgroup is mapped in openmhz - In the "talk group" key of the API response. We're interested in talk group 101 (dispatch) and 728/729 (EMS 5 and 6) Need to get an estimate of how many calls we will transcribe, 60 free minutes a month on cloud services -500 car crash calls a month give or take, say each dispatch call is 30 seconds Perhaps we can use AWS and Google to get 120 minutes of free transcription Is it possible to geotag calls? -hopefully we can map this data to the Pulsepoint API using call time and unit numbers, which has the geotag Lots of calls are short, is there value in transcribing these calls? -probably not - dispatch is going to be most important How often do we want to check for calls? -scrape say once an hour? How long are calls stored in opehnmz? -For the past 30 days

banjtheman commented 3 years ago

Made update, can run the following workflow

The next steps will be

  1. Create a table for audio data in the database
  2. Setup the script to the dc_crash_bot server
  3. Create a cron job for the script to run
  4. Convert JSON output to match the database schema
  5. See if we can match audio data with pulse point data
banjtheman commented 3 years ago

Here is an example output

  {
    "id": "609c91e7c565b14d6ccb05f3",
    "source": 101,
    "audio_url": "https://s3.us-east-2.wasabisys.com/openmhz/media/dcfd-101-1620873678.m4a",
    "timestamp": "2021-05-13T02:41:18.000Z",
    "call_length": 19,
    "transcribed_audio": "Medical Local 26 respond to L. S. Person down 14th Rhode Island Avenue Northeast offered on channel 0 11. Medical. Local 26 respond to L. S. A. Person down 14 to Rhode Island Avenue Northeast station will be in a black escalade 7 11 parking lot operate on channel 0 11. At 22 41."
  },
CharlotteJackson commented 3 years ago

whooooo hooo we got it running! :)