Closed keighrim closed 11 months ago
@keighrim
I don't see a annotations/221201-aapb-collaboration-7
in newshour-namedentity-wikipedialink. Unless you meant -21
instead of -7
. Also, the golds/aapb-collaboration-21
directory and the 221201-aapb-collaboration-21
both contain an annotations.tab
that is identical. Is that by mistake?
Yup, I meant -21
, my bad.
This issue is about them being identical, whereas we want the golds
to have GUID.someformat
files for each media in the annotated batch.
So the process.py
needs to split the existing annotations.tab
into chunks, one for each unique transcript.ann? Is my understanding correct?
Yes, you understood correctly, except for the output format can't really be ann
format as that format is Brat-specific span-tagging format and won't work with grounding annotation.
Come up with a data format that can gracefully represent the grounding annotation (json, csv, etc.), and document what the format is (fields, value types, etc.) in the newshour-ne-wiki/README.md
file.
Okay, thank you. As far as labeling the columns, I wasn't sure what the integers in the second-to-last column of annotations.tab
corresponded to. Do you know?
re-closed via #29.
Because
The 2022 December NEL project's
process.py
files from https://github.com/clamsproject/aapb-annotations/commit/ab8eb705f6950054d8cea1cdbc439f88bb957611 is doing nothing more than simple file copying, but since the commit we change how we structure this repository, hence the script needs to be completely re-done. Specifically, as stated in the repository README file, we want one file per one media in the gold data. Namely,process.py
needs to read all the tabular files from in theYYMMDD-batchname
directories (currently there's only one, namelyannotations/221201-aapb-collaboration-7
) and generate one file per GUID.For data format of the future gold files is a subject to discuss.
Done when
golds
directory, replacing the current one.Additional context
No response