LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

Annotators agreement metrics #201

Closed lucasgautheron closed 3 years ago

lucasgautheron commented 3 years ago

Is your feature request related to a problem? Please describe.

Ideally the package should provide functions to compute annotators agreement metrics, including the following

Describe the solution you'd like

Usually there is a number of steps (the order can change):

  1. calculating the intersection of annotation sets (we already have a function that does that)
  2. retrieving the segments corresponding to the intersection (we already have a function that does that EXCEPT it does not clip the segments to the desired bounds - but we have another function for that already)
  3. Aligning the segments (using time_seek)
  4. grouping by unit: recording, session, everything, etc. This means session_offset should be used at some point for further alignment...
  5. selecting and filtering the data (e.g. select speaker_type and remove segments that are neither of MAL, FEM, CHI, OCH)
  6. transforming annotations into a format that can be accepted by the metric function (pyannote Annotation, nltk AnnotationTask, etc.)
  7. computing the metric and returning results at the desired level (recording, session, everything...)

We could progressively implement more steps into the package.

Users should be able to replace the package with custom functions at any step. But maybe we should provide built-in CLI implementations for most common use-cases.

So the implementation should make it easy to progressively add features.