Closed marcverhagen closed 1 year ago
For NE annotation, I think we concluded to use ann
format in our last week's meeting as it is actually pretty portable and informative enough to handle all character-based offsets and NE labels. It also hides the source text, so we don't have to hide the data in a private repo.
Yes, that is how I remember it too, just use the .ann
format.
okay... can you take care of replacing gold files and process.py?
Yes
There is a potential copyright problem with this. If we have NER annotations over transcripts then the CoNNL files will have all tokens, not just the tokens in named entities. The GBH legal department will probably frown on this.
We either use some other format like MMIF or make this a private directory. I am experimenting with the former for NER since it is much more compact than CoNNL.
Any other solutions?