Open keighrim opened 7 months ago
Trying to understand what might be the plan and the idea behind the data organization. Please clarify if I have assumed incorrectly.
datedone-issue-number
. We can add an annotator identifier if we like, annotatorA
. I think we definitely should, just to better track these. If I understand this correctly, golds are meant to be used for evaluation/training/etc/machine-consumable. If we use a set for something, it is permalinked to a commit that contains that golds set used for whatever the machine usage was. Even if we were to use a raw set or non-adjudicated set for eval (which we are doing with the RFB modeling ), the linkage to the data use exists when that usage is searched up. As we update to better, more adjudicated golds data, it would make sense to then update the main-branch golds to reflect that.
Therefore, shouldn't the golds always be basically the best, publically-release-able, most adjudicated/combined version of the data that we have? (in other words: I dont think there's a need for a new category of release data.)
If the intent of this repo to show later users how to create their own data, and to provide accountability on how this project was done, it makes sense to keep the sets of raws, but updated versions of the golds can be saved vertically via version control.
We should see how automated/manual the adjudication process is. Currently, is like a step that happens before process.py -> golds. Or at the same time. We should publish what happened to it. I think in the process.py section of the readme is a time-reasonable place to put it.
Results should likely be a separate file, linked in the readme. Possibly with a tl;dr of the summary results.
What is the difference between golds and release? My assumptions are these: We can have multiple raws from different annotators.
I share the same understanding. The different between golds and release essentially lays on the fact that in most cases, adjudication of dually/triply annotated data (namely, dealing with 2 or 3 "raw" sets) will be very manual process (someone has to see the "diff" between those sets and decide which one is the best). When the number of annotators is high enough (say over 5 or so) then we can mechanically count "votes" and take the majority.
On the other side, creating "release" from "gold" set usually just re-formatting of internal structure (not contents) and thus done via code (proc.py
)
We should see how automated/manual the adjudication process is. Currently, is like a step that happens before process.py -> golds. Or at the same time.
So this is tricky and the same realization led me to my question#3. In other words, can we deliberately distinguish "manual adjudication" work and "automatic reformat" work? Adjudication can be fully-manual (decision by the third human eye), fully-automatic (just count voting), or mixed. proc.py
is automatic. When the adj involved manual work, there will be a intermediate set of data that's passed to proc.py
, and I called them as a new gold
. If there's no such intermediate manual step either by 1. lack of duplicate annotations, or 2. fully-automated adjudication, then I guess we can have one proc.py
and raw
and release=gold
happily.
I think we are approaching a similar structure, with slightly different wording.
It seems like there is now another step of processing between raws
and golds
:
raws
annotated by different annotators uploaded to this repository. raws
->1 adjud-raws
.adjud-raws
-> golds
golds
, same as before. Version control can keep older versions of adjud-raws
and golds
.
For current ease, we should save the adjud-raw
on the repo as well, so that all data can be stored here.
The confusion seems to lie in why call the adjudicated raws
as the new golds
, and then call the new golds
as release
? If we ever do adjudication on other projects we'd have to change all the other specifications and wording to fit. Right now, we seem to use golds
for inputting into machine/models. At the end of this new procedure, the final output is still that: one adjudicated dataset ready to be inputted into machine/models.
Thus, it seems like we have no need for the term release
for now, is that correct?
I would guess that we should be able to distinguish them. Regardless of whether the adjudication is manual or automatic or a mix of both, we should document how it is done and by what procedure in the project readmes.
It may be a good idea to place code related to adjudication (if any is needed) in its own new folder within the project subdir to avoid having it confused with code related to process.py
.
The confusion seems to lie in why call the adjudicated raws as the new golds, and then call the new golds as release? ... Thus, it seems like we have no need for the term release for now, is that correct?
I wasn't implying we need to change the terminology we use in directory names. I just wanted to have distinctive names for different stages of annotation data that I can use in this discussion.
We start with multiple raws annotated by different annotators uploaded to this repository.
I don't expect individual annotators to be git-fluent enough to know how to upload files to a github repo using a specific branch. So the upload will probably be done by the project manager. Given that, going back to one of my original questions is
Do we want to host both "raw" and "gold", or "gold" only?
and in my opinion, if there's an adjudication step involved, I don't think we need to keep the raw raws (before adjudication) in the repo, but keep them just internally (based on the upload method the annotators used during their work), or in a separate repo with annotation and adjudication code live (probably one of aapb-annenv-
repos). In that way,
raw
is created in a single place, this can include
aapb-annotations
repo's primary purpose as "releasing" annotation data as a machine-consumable format. raw
and gold
) by a single release code (process.py
) What do others think?
from https://github.com/clamsproject/aapb-annotations/issues/35#issuecomment-1868132327 (@jarumihooi)
question about where to place IAA code. What conceptually drives the separation of different annenv tools vs this as the dataset repository?
So I initially suggested 3-tier data release via this repo (raw
> gold
> release
), but it seems that we now all agree to keep the current 2-tier model (raw
> release
). With that, I don't think there is neat room to fit raws-before-adjudication (multi-set) in this repo, and that's why I'm proposing using annenv
repos to hold raws before adjudication (with related adjudication components including IAA calculation), and use this repo for adjudication, single-set raw dataset.
Any other thoughts, or suggestions?
Because
(I'm using the term
dual annotation
to indicate manual annotation redundantly done by any number of annotators more than one)So far, all the annotation projects we've worked on had single annotation. Based on that fact, we designed workflow regarding processing of annotation data (raw >> gold, organization under batches and dates, etc.) without consideration of
However, in the latest annotation effort - RFB - we started dual annotation, at least for a subset of the whole dataset. And I think it's now a time to discuss how we want to host dual annotations and the adjudicated single set "raw" data in this public repo. Concretely,
raw
, hereinafter"raw"
)"gold"
)gold
, hereinafter"release"
)process.py
?README
, or a separate file/directory?And maybe more questions.
Starting this issue to discuss details Any input is welcome!
Done when
We set a guideline or template for handling
Additional context
No response