Closed jamespstrachan closed 4 years ago
Quite possible. Is it worth doing this after the new standard transcription format is settled? I imagine something like:
(1@00:00) Jack and Jill went up the hill
(2@00:05) To fetch a pail of water
(3@00:09) ...
Where the time codes are embedded in the paragraph markers. @codykingham might that work as part of your standard?
Yes, after the standard transcription sounds best. The time codes could be embedded in the paragraph numbers, but the researchers who create the texts should ensure that the paragraphs are several lines long, so that the labour of aligning each paragaph is reduced.
Technically these would be line indicators, not paragraphs. I think this depends on how the time code annotations will be made. There are some options out there for automatic word-to-word alignment of texts (e.g. here). These tools could align transcriptions at the word level. In that case, I would suggest we store these as features of words in Text-Fabric. Then for any given word, you easily retrieve its timestamp.
If, however, the time codes will be made by the researcher at the time of upload, then it indeed makes sense to put this is the upload template somehow. @jamespstrachan , I do like your proposed format.
Great if we can use an existing standard, though the granularity offered by the automatic tools will be labour-intensive for a human to replicate. This might limit how easily contributors could submit well-time-annotated text unless we can easily/automatically run submitted audio+transcript through the auto-alignment software. @codykingham do you have any sense of how easy this is to do, it's fault-tolerance and how well it fits into the 'production line' you're considering for processing Nena audio->word-transcript?->.nena format->time-annotated-format->text-fabric?
I need to experiment. @hvlaardingerbroek did a test with a similar tool (not sure if it's the same as the linked one), and it produced fairly accurate results on NENA audio without any tweaking. But corrections would still be needed to the output. I'm leaning more and more towards the timestamp indicator in the line number. That is simpler and less dependent on the coding side.
After some more thought, I propose we go with @jamespstrachan 's proposal. In the plain text format, then, line numbers can optionally be composed of two parts: A number and a timestamp:
(1@0:02) Some line here
(2@0.05) some other line.
I've had a shot at implementing this - it's a little rough but it works. It's live now on the staging instance. I have tried to fill in some time codes for this one:
https://nena-staging.ames.cam.ac.uk/audio/30/
See what you think, try adding some time codes yourself. Don't worry, this is on a separate database so anything you mess up here will not affect production. (conversely, doing lots of useful data entry here will be time wasted!)
Thanks, James. This looks great.
@jamespstrachan Great work! This is a very nice demo. I find it quite easy to follow along set up this way. This is the simplest option, I think, with the least amount of dependencies. Better than automated alignments.
Now in production
from @GeoffreyKhan :