CoEDL / elpis

🙊 software for creating speech recognition models.
https://elpis.readthedocs.io/en/latest/
Apache License 2.0
152 stars 33 forks source link

Support non-identical file names between .wav and .eaf, and recognise media offsets #215

Open mattchrlw opened 3 years ago

mattchrlw commented 3 years ago

Resolves #191, #193.

This implementation doesn't give the user any choice as to whether to match the file name of the corresponding .eaf file or to just get it from RELATIVE_MEDIA_URL. It defaults to the former behaviour and falls back to the later.

This implementation also ignores MEDIA_URL as it is difficult to wrestle it (e.g. "file:///Users/bbb/Desktop/abui/abui-audio-1.wav") into a format that the rest of the application will be able to handle easily. In other words, it assumes that RELATIVE_MEDIA_URL is well formed.

This also fixes any line = wer_lines[0] IndexError: list index out of range errors that may have been happening before, although please double check they are actually fixed.

Offsets are directly int()-ed from the .eaf file.

mattchrlw commented 3 years ago

I think a big part of the original tickets was implementing a UI feature that would highlight (in particular) if audio or eaf were uploaded without the corresponding eaf or audio file (respectively). The easiest way I can envisage to accomplish this is aligning the audio files horizontally in the UI with their transcriptions, which would make it obvious that a pair was missing either component (you could also highlight rows with a missing file, or something to that effect). The unfortunate downside to this is that you'll have to replicate the verification on the front end (This might help: https://www.npmjs.com/package/elan-parser ).

Ah okay, this might be a bit more work to do on the uploading side of things, as it won't just be a file drop anymore. But I can look into it 👌