Closed keighrim closed 9 months ago
In terms of annotator demographics, here's an example from https://aclanthology.org/2023.law-1.18/
batch names must be more explicit and meaningful. For example, I use aapb-collaboration-21 as a batch name to indicate the batch is created from https://github.com/clamsproject/aapb-collaboration/issues/21, but it can be better with, for example, aapb-collaboration-issue21 to mark the linked issue.
Now that many batch files have the top comment with a url "for more details", I don't think we nee don't change the batch names.
$ head -n1 batches/*
==> batches/aapb-annenv-role-filler-binder-11.txt <==
# for more details, see https://github.com/clamsproject/aapb-annenv-role-filler-binder/issues/11
==> batches/aapb-annotation-44.txt <==
# see https://github.com/clamsproject/aapb-annotations/issues/44 for selection process
==> batches/aapb-collaboration-21.txt <==
cpb-aacip-507-154dn40c26
==> batches/aapb-collaboration-7.txt <==
# see https://github.com/clamsproject/aapb-collaboration/issues/7 for more info
==> batches/batch2.txt <==
cpb-aacip-507-154dn40c26
@jarumihooi can you make sure all batch files have that top line while you're working on the readme of the project that's relevant to that batch?
Hi Keigh, can you clarify what the workflow here is?
While (working on each project), how do I check which batches have been used for that project? Do we do it manually? (Is it tracked somewhere?)
If I open the batches currently, this is what I see: 5 batches
Of them, aapb-collaboration-21.txt (in NE and NEL) and batch2.txt (chyrons) do not have this information. Secondly, where do I get this information?
For 21 its here: https://github.com/clamsproject/aapb-collaboration/issues/21 What about batch2? This does not seem like it: https://github.com/clamsproject/aapb-collaboration/issues/2 This comment from you seems to also say we dont know how batch2 came to be: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043
Link to that issue (...-21 one) or the comment (batch2) should be enough. So, how do you know which projects used which batches? https://github.com/clamsproject/aapb-annotations/#raw-annotation-files It not automatically "tracked" but you can do search through project directories (e.g.find
).
What am I searching/find
ing for?
Right now, I've looked manually thru each raw data section. As far as I'm aware these and the new ones for rfb and swt are the only batches. Is that correct?
Is there a batch prepared for the clustering swt-subproject?
What am I searching/finding for?
From the readme that you participated in writing;
The batchName part of the directory name must match only one of .txt files in the batches.
Is that correct? Is there a batch prepared for the clustering swt-subproject?
If you don't see one on the public repo (in main
branch or in a working branch/PR), you can consider it as non-existing.
Updated batches to have comments linking to the github issue that explains their creation. Checking off the related box.
"should golds dir contain subdirs named after the batch names?" -> no, standardized by Keigh in https://github.com/clamsproject/aapb-annotations/commit/d1fb17fbf1722826fb951565a15aa24a916224ab. checked.
" for slate and chyrons, there are some overlapping annotation properties (start/end time of the time frames) but the column names are different" -> this is handled by fieldname conventions set up in https://github.com/clamsproject/aapb-annotations/issues/56. Marking done for this issue to avoid duplicate.
There are two remaining questions on the general structure of this repo. Are these answerable and should this issue become closed? @keighrim
There is an ongoing question about where to place IAA code. What conceptually drives the separation of different annenv tools vs this as the dataset repository? (If this question is unrelated to the general structure of this repo/this issue, then at the present, I am unaware of any other further discussions on this topic).
should "gold" files can have comments inside?
If the format of the gold files allows syntax for commenting, why not? I think the answer is yes.
should process.py has unified CLI argument structure?
Since we now know that
YYMMDD-
golds
subdirectory must not have internal structureprocess.py
should collect all raw files and put them in golds
without subdirs
, now I believe that should not have any CLI, and all current process.py
scripts are updated so via #77. For IAA/adjudication related discussion, there's #71...?
I guess all the originally raised questions are now fixed or answered. Closing the issue as completed.
(This is a continuation from #2)
After #2 is closed, we have worked on reformatting the annotation and gold files, putting documentations in their places based on the conclusion from that previous discussion. However, while doing so we find some places that needs additional clarification and/or modification.
Here are some items @jarumihooi and I came up with today through out a long collective review of the current status of the repository.
aapb-collaboration-21
as a batch name to indicate the batch is created from https://github.com/clamsproject/aapb-collaboration/issues/21, but it can be better with, for example,aapb-collaboration-issue21
to mark the linked issue.guidelines.md
andREADME.md
in individual annotation project subdirs can be merged. Mostly because sometimes there isn't a clearcut between those two contents, and both need constant updating while the associated annotation project is on-going.golds
dir contain subdirs named after the batch names?process.py
has unified CLI argument structure?@jarumihooi feel free to edit this if I missed anything from our discussion today.