handling dually annotated data

clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration

3 stars 0 forks source link

handling dually annotated data #71

Open keighrim opened 7 months ago

keighrim commented 7 months ago

Because

(I'm using the term dual annotation to indicate manual annotation redundantly done by any number of annotators more than one)

So far, all the annotation projects we've worked on had single annotation. Based on that fact, we designed workflow regarding processing of annotation data (raw >> gold, organization under batches and dates, etc.) without consideration of

IAA measurement
adjudication/curation for merging dual annotation

However, in the latest annotation effort - RFB - we started dual annotation, at least for a subset of the whole dataset. And I think it's now a time to discuss how we want to host dual annotations and the adjudicated single set "raw" data in this public repo. Concretely,

We need fixed terms to indicate
1. raw manual annotation (currently called raw, hereinafter "raw")
2. adjudicated "gold" annotation (currently no such thing, hereinafter "gold")
3. machine-ready "public" annotation (currently called gold, hereinafter "release")
Do we want to host both "raw" and "gold", or "gold" only?
How do we publish the adjudication process, if any. I can imagine all-manual adjudication and code-assisted adjudication. In the latter, should we consider special handling of adjudication code, just like process.py?
Where should the IAA calculation results be reported? In README, or a separate file/directory?

And maybe more questions.

Starting this issue to discuss details Any input is welcome!

Done when

We set a guideline or template for handling

dual "raw" annotation files
IAA reports
documentation of adjudication process

Additional context

No response

jarumihooi commented 7 months ago

Trying to understand what might be the plan and the idea behind the data organization. Please clarify if I have assumed incorrectly.

What is the difference between golds and release? My assumptions are these: We can have multiple raws from different annotators.
- Raws are directorie'd as datedone-issue-number. We can add an annotator identifier if we like, annotatorA. I think we definitely should, just to better track these.

If I understand this correctly, golds are meant to be used for evaluation/training/etc/machine-consumable. If we use a set for something, it is permalinked to a commit that contains that golds set used for whatever the machine usage was. Even if we were to use a raw set or non-adjudicated set for eval (which we are doing with the RFB modeling ), the linkage to the data use exists when that usage is searched up. As we update to better, more adjudicated golds data, it would make sense to then update the main-branch golds to reflect that.

Therefore, shouldn't the golds always be basically the best, publically-release-able, most adjudicated/combined version of the data that we have? (in other words: I dont think there's a need for a new category of release data.)

Therefore, it seems like we should basically have only raws and golds still, and the main-branch golds should be the most updated (ergo, release).

If the intent of this repo to show later users how to create their own data, and to provide accountability on how this project was done, it makes sense to keep the sets of raws, but updated versions of the golds can be saved vertically via version control.

We should see how automated/manual the adjudication process is. Currently, is like a step that happens before process.py -> golds. Or at the same time. We should publish what happened to it. I think in the process.py section of the readme is a time-reasonable place to put it.
Results should likely be a separate file, linked in the readme. Possibly with a tl;dr of the summary results.

keighrim commented 7 months ago

What is the difference between golds and release? My assumptions are these: We can have multiple raws from different annotators.

I share the same understanding. The different between golds and release essentially lays on the fact that in most cases, adjudication of dually/triply annotated data (namely, dealing with 2 or 3 "raw" sets) will be very manual process (someone has to see the "diff" between those sets and decide which one is the best). When the number of annotators is high enough (say over 5 or so) then we can mechanically count "votes" and take the majority.

On the other side, creating "release" from "gold" set usually just re-formatting of internal structure (not contents) and thus done via code (proc.py)

We should see how automated/manual the adjudication process is. Currently, is like a step that happens before process.py -> golds. Or at the same time.

So this is tricky and the same realization led me to my question#3. In other words, can we deliberately distinguish "manual adjudication" work and "automatic reformat" work? Adjudication can be fully-manual (decision by the third human eye), fully-automatic (just count voting), or mixed. proc.py is automatic. When the adj involved manual work, there will be a intermediate set of data that's passed to proc.py, and I called them as a new gold. If there's no such intermediate manual step either by 1. lack of duplicate annotations, or 2. fully-automated adjudication, then I guess we can have one proc.py and raw and release=gold happily.

jarumihooi commented 7 months ago

I think we are approaching a similar structure, with slightly different wording.

It seems like there is now another step of processing between raws and golds:

We start with multiple raws annotated by different annotators uploaded to this repository.
[New step] We then (possibly manually) adjudicate the multiple raws ->1 adjud-raws.
Then the automatic process.py reformats adjud-raws -> golds
Finally, we end up with most up-to-date golds, same as before.

Version control can keep older versions of adjud-raws and golds. For current ease, we should save the adjud-raw on the repo as well, so that all data can be stored here.

The confusion seems to lie in why call the adjudicated raws as the new golds, and then call the new golds as release? If we ever do adjudication on other projects we'd have to change all the other specifications and wording to fit. Right now, we seem to use golds for inputting into machine/models. At the end of this new procedure, the final output is still that: one adjudicated dataset ready to be inputted into machine/models. Thus, it seems like we have no need for the term release for now, is that correct?

I would guess that we should be able to distinguish them. Regardless of whether the adjudication is manual or automatic or a mix of both, we should document how it is done and by what procedure in the project readmes. It may be a good idea to place code related to adjudication (if any is needed) in its own new folder within the project subdir to avoid having it confused with code related to process.py.

keighrim commented 7 months ago

The confusion seems to lie in why call the adjudicated raws as the new golds, and then call the new golds as release? ... Thus, it seems like we have no need for the term release for now, is that correct?

I wasn't implying we need to change the terminology we use in directory names. I just wanted to have distinctive names for different stages of annotation data that I can use in this discussion.

We start with multiple raws annotated by different annotators uploaded to this repository.

I don't expect individual annotators to be git-fluent enough to know how to upload files to a github repo using a specific branch. So the upload will probably be done by the project manager. Given that, going back to one of my original questions is

Do we want to host both "raw" and "gold", or "gold" only?

and in my opinion, if there's an adjudication step involved, I don't think we need to keep the raw raws (before adjudication) in the repo, but keep them just internally (based on the upload method the annotators used during their work), or in a separate repo with annotation and adjudication code live (probably one of aapb-annenv- repos). In that way,

we can keep all the relevant code/manuals that used for up to the point where raw is created in a single place, this can include
1. annotation tool/environment
2. adjudication tool/environment/code
3. IAA calculation code
we keep aapb-annotations repo's primary purpose as "releasing" annotation data as a machine-consumable format.
we keep the simpler two-stage release model (raw and gold) by a single release code (process.py)

What do others think?

keighrim commented 6 months ago

from https://github.com/clamsproject/aapb-annotations/issues/35#issuecomment-1868132327 (@jarumihooi)

question about where to place IAA code. What conceptually drives the separation of different annenv tools vs this as the dataset repository?

So I initially suggested 3-tier data release via this repo (raw > gold > release), but it seems that we now all agree to keep the current 2-tier model (raw > release). With that, I don't think there is neat room to fit raws-before-adjudication (multi-set) in this repo, and that's why I'm proposing using annenv repos to hold raws before adjudication (with related adjudication components including IAA calculation), and use this repo for adjudication, single-set raw dataset.

keighrim commented 5 months ago

Any other thoughts, or suggestions?