changes to make w.r.t general structure of this repo

keighrim commented 1 year ago

(This is a continuation from #2)

After #2 is closed, we have worked on reformatting the annotation and gold files, putting documentations in their places based on the conclusion from that previous discussion. However, while doing so we find some places that needs additional clarification and/or modification.

Here are some items @jarumihooi and I came up with today through out a long collective review of the current status of the repository.

[x] all time-based gold data should stick to a one and only time unit to represent time points (or windows of time). We can continue discussion on which unit to use in https://github.com/clamsproject/mmif/issues/192
[x] batch names must be more explicit and meaningful. For example, I use aapb-collaboration-21 as a batch name to indicate the batch is created from https://github.com/clamsproject/aapb-collaboration/issues/21, but it can be better with, for example, aapb-collaboration-issue21 to mark the linked issue.
[x] guidelines.md and README.md in individual annotation project subdirs can be merged. Mostly because sometimes there isn't a clearcut between those two contents, and both need constant updating while the associated annotation project is on-going.
[x] some discrepancies between the main README and actual file organization
- [x] should golds dir contain subdirs named after the batch names?
  - answer: no
- [x] should process.py has unified CLI argument structure?
  - answer: yes (as no CLI at all)
- [x] should "gold" files can have comments inside?
  - answer: yes, depending on the syntax
- [x] for slate and chyrons, there are some overlapping annotation properties (start/end time of the time frames) but the column names are different
  - answer: fixed via #77

@jarumihooi feel free to edit this if I missed anything from our discussion today.

keighrim commented 1 year ago

In terms of annotator demographics, here's an example from https://aclanthology.org/2023.law-1.18/

keighrim commented 1 year ago

batch names must be more explicit and meaningful. For example, I use aapb-collaboration-21 as a batch name to indicate the batch is created from https://github.com/clamsproject/aapb-collaboration/issues/21, but it can be better with, for example, aapb-collaboration-issue21 to mark the linked issue.

Now that many batch files have the top comment with a url "for more details", I don't think we nee don't change the batch names.

$ head -n1 batches/*
==> batches/aapb-annenv-role-filler-binder-11.txt <==
# for more details, see https://github.com/clamsproject/aapb-annenv-role-filler-binder/issues/11 

==> batches/aapb-annotation-44.txt <==
# see https://github.com/clamsproject/aapb-annotations/issues/44 for selection process

==> batches/aapb-collaboration-21.txt <==
cpb-aacip-507-154dn40c26

==> batches/aapb-collaboration-7.txt <==
# see https://github.com/clamsproject/aapb-collaboration/issues/7 for more info

==> batches/batch2.txt <==
cpb-aacip-507-154dn40c26

@jarumihooi can you make sure all batch files have that top line while you're working on the readme of the project that's relevant to that batch?

jarumihooi commented 1 year ago

Hi Keigh, can you clarify what the workflow here is?

While (working on each project), how do I check which batches have been used for that project? Do we do it manually? (Is it tracked somewhere?)

If I open the batches currently, this is what I see: 5 batches

Of them, aapb-collaboration-21.txt (in NE and NEL) and batch2.txt (chyrons) do not have this information. Secondly, where do I get this information?

For 21 its here: https://github.com/clamsproject/aapb-collaboration/issues/21 What about batch2? This does not seem like it: https://github.com/clamsproject/aapb-collaboration/issues/2 This comment from you seems to also say we dont know how batch2 came to be: https://github.com/clamsproject/aapb-annotations/issues/24#issuecomment-1638870043

keighrim commented 1 year ago

Link to that issue (...-21 one) or the comment (batch2) should be enough. So, how do you know which projects used which batches? https://github.com/clamsproject/aapb-annotations/#raw-annotation-files It not automatically "tracked" but you can do search through project directories (e.g.find).

jarumihooi commented 1 year ago

What am I searching/finding for? Right now, I've looked manually thru each raw data section. As far as I'm aware these and the new ones for rfb and swt are the only batches. Is that correct?

Is there a batch prepared for the clustering swt-subproject?

keighrim commented 1 year ago

What am I searching/finding for?

From the readme that you participated in writing;

The batchName part of the directory name must match only one of .txt files in the batches.

Is that correct? Is there a batch prepared for the clustering swt-subproject?

If you don't see one on the public repo (in main branch or in a working branch/PR), you can consider it as non-existing.

jarumihooi commented 12 months ago

Updated batches to have comments linking to the github issue that explains their creation. Checking off the related box.

jarumihooi commented 10 months ago

"should golds dir contain subdirs named after the batch names?" -> no, standardized by Keigh in https://github.com/clamsproject/aapb-annotations/commit/d1fb17fbf1722826fb951565a15aa24a916224ab. checked.

" for slate and chyrons, there are some overlapping annotation properties (start/end time of the time frames) but the column names are different" -> this is handled by fieldname conventions set up in https://github.com/clamsproject/aapb-annotations/issues/56. Marking done for this issue to avoid duplicate.

jarumihooi commented 9 months ago

There are two remaining questions on the general structure of this repo. Are these answerable and should this issue become closed? @keighrim

There is an ongoing question about where to place IAA code. What conceptually drives the separation of different annenv tools vs this as the dataset repository? (If this question is unrelated to the general structure of this repo/this issue, then at the present, I am unaware of any other further discussions on this topic).

keighrim commented 9 months ago

should "gold" files can have comments inside?

If the format of the gold files allows syntax for commenting, why not? I think the answer is yes.

should process.py has unified CLI argument structure?

Since we now know that

all raw subdirectories are prefixed with YYMMDD-
golds subdirectory must not have internal structure
process.py should collect all raw files and put them in golds without subdirs , now I believe that should not have any CLI, and all current process.py scripts are updated so via #77.

For IAA/adjudication related discussion, there's #71...?

I guess all the originally raised questions are now fixed or answered. Closing the issue as completed.

clamsproject / aapb-annotations

changes to make w.r.t general structure of this repo #35