clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

transcript time sync annotation #5

Open keighrim opened 2 years ago

keighrim commented 2 years ago
  1. We will use cadet.
  2. Annotation will be done at the sentence or pseudo-sentence (fixed number of tokens or time) level.
keighrim commented 11 months ago

I was looking at the Gabe's annotation data to evaluate gentle forced aligner, and found that text used for cadet annotation is very different from what we were considering as "gold" transcript text. We may need to completely redo the cadet annotation using THE original data as underlying text, unless someone who actually worked on the annotation process (such as @caseyedavis12 or @gmalexander29 ) can guide us to any documentation on what the transformation process was added to the text, so that we can reproduce the same data.

Quick and dirty workaround would be treating the text from cadet annotation as "gold" and re-run the forced aligner, but then we got into a situation where we will have two "golds" under one GUID, and our evaluation framework/code is not currently designed to deal with such situation.

Here's one example of many discrepancies between the original text, cadet text, text from gentle.

jarumihooi commented 8 months ago

The original issue's name was: timescript time sync annotation guideline. The expected tasks are to add a readme and guideline to this project. This is done with a readme. It has the guideline inside. Therefore its fixed partially by #64 .
However, the issue brought up with Keigh comment about the gold data from annotation and the evaluation gold not matching is still an issue. Therefore, the issue will be renamed and remain open. This issue is linked in the readme.

keighrim commented 7 months ago

One way we can "resolve" this once and for all, is to write a code to align tokens in two transcripts and transfer time stamp annotations over the original gold transcript files. But I expect writing that piece of code will be not so trivial, so not sure it's worthwhile effort.