clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

add `process.py` for SR project #75

Closed keighrim closed 3 months ago

keighrim commented 10 months ago

using the new gold column naming convention.

keighrim commented 9 months ago

Related to https://github.com/clamsproject/app-swt-detection/issues/41, I had a brief discussion with @marcverhagen , and we need to decide what is the format of the gold files for SR annotations. Concretely, first thing to decide is whether the gold is time (interval)-based or image-based, or both.

In case we want to keep two representations in the gold format, we've been using csv files with start, end columns in other SR-like past projects (slates, chyrons), and I can't think of an easy way to keep the csv format (for reusing other eval.py files) and, at the same time, to store image-level annotation in that csv format as additional columns. And this repo is designed to allow only one format for golds, so we might need to reconsider that decision as well, if we can't find a way to use a single format to hold two different levels of representation and have to generate two formats.

keighrim commented 4 months ago

Given the way we restructured the SWT app to keep image annotations (TimePoint annotations), I think we can only keep image-based "gold" set fot SR project.

So the output format can be a csv for each cpb-.... ID,

# cpb-xxx-yyyyy
timepoint,label 
t1,B
t2,SH
...

For all the "seen" timepoints in the raw data.

keighrim commented 4 months ago

At the second look, since the "raw" portion of the annotation data is already organized by the GUIDs, we probably don't need to introduce a new format for gold, and instead can just copy raw files into gold dir.

keighrim commented 4 months ago

Looking at the files third time, it looks like we can actually benefit from altering the columns a bit. Specifically, given this "raw" format

filename    seen    type label  subtype label   modifier    transcript  note
cpb-aacip-0acac5e9db7_01824989_00000000.jpg true    B       false       
...
cpb-aacip-0acac5e9db7_01824989_00082015.jpg true    S   H   false       
...
  1. change the first column timepoint or timestamps, and take only the the last part of the jpg file name (I believe the number is milliseconds, so might need to re-format based our timeunit convention for gold data (https://github.com/clamsproject/aapb-annotations/blob/main/repository_level_conventions.md)
  2. keep the "total" duration (second piece in the jpg file name) as a separate column
  3. remove rows that are now "seen"