clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration
3 stars 0 forks source link

reformat NEL gold files #19

Closed keighrim closed 11 months ago

keighrim commented 1 year ago

Because

The 2022 December NEL project'sprocess.py files from https://github.com/clamsproject/aapb-annotations/commit/ab8eb705f6950054d8cea1cdbc439f88bb957611 is doing nothing more than simple file copying, but since the commit we change how we structure this repository, hence the script needs to be completely re-done. Specifically, as stated in the repository README file, we want one file per one media in the gold data. Namely, process.py needs to read all the tabular files from in the YYMMDD-batchname directories (currently there's only one, namely annotations/221201-aapb-collaboration-7) and generate one file per GUID.

For data format of the future gold files is a subject to discuss.

Done when

Additional context

No response

wricketts commented 1 year ago

@keighrim I don't see a annotations/221201-aapb-collaboration-7 in newshour-namedentity-wikipedialink. Unless you meant -21 instead of -7. Also, the golds/aapb-collaboration-21 directory and the 221201-aapb-collaboration-21 both contain an annotations.tab that is identical. Is that by mistake?

keighrim commented 1 year ago

Yup, I meant -21, my bad.


This issue is about them being identical, whereas we want the golds to have GUID.someformat files for each media in the annotated batch.

wricketts commented 1 year ago

So the process.py needs to split the existing annotations.tab into chunks, one for each unique transcript.ann? Is my understanding correct?

keighrim commented 1 year ago

Yes, you understood correctly, except for the output format can't really be ann format as that format is Brat-specific span-tagging format and won't work with grounding annotation.

Come up with a data format that can gracefully represent the grounding annotation (json, csv, etc.), and document what the format is (fields, value types, etc.) in the newshour-ne-wiki/README.md file.

wricketts commented 1 year ago

Okay, thank you. As far as labeling the columns, I wasn't sure what the integers in the second-to-last column of annotations.tab corresponded to. Do you know?

keighrim commented 11 months ago

re-closed via #29.