Using the CoNNL format - Githubissues

clamsproject / aapb-annotations

Repository to store manual annotation dataset developed for CLAMS-AAPB collaboration

3 stars 0 forks source link

Using the CoNNL format #8

Closed marcverhagen closed 1 year ago

marcverhagen commented 1 year ago

There is a potential copyright problem with this. If we have NER annotations over transcripts then the CoNNL files will have all tokens, not just the tokens in named entities. The GBH legal department will probably frown on this.

We either use some other format like MMIF or make this a private directory. I am experimenting with the former for NER since it is much more compact than CoNNL.

Any other solutions?

keighrim commented 1 year ago

For NE annotation, I think we concluded to use ann format in our last week's meeting as it is actually pretty portable and informative enough to handle all character-based offsets and NE labels. It also hides the source text, so we don't have to hide the data in a private repo.

marcverhagen commented 1 year ago

Yes, that is how I remember it too, just use the .ann format.

keighrim commented 1 year ago

okay... can you take care of replacing gold files and process.py?

marcverhagen commented 1 year ago

Yes