HumanSignal / label-studio-converter

Tools for converting Label Studio annotations into common dataset formats
https://labelstud.io/
261 stars 132 forks source link

Documents delimiter for CoNLL 2003 NER #10

Open abduhbm opened 4 years ago

abduhbm commented 4 years ago

Exported annotated data in CoNLL 2003 NER format cannot be imported in SpaCy. SpaCy expects documents to be separated using -DOCSTART- -X- O O line and sentences with whitespaces as per its documentation for converting CoNLL-2003 NER format to json. https://spacy.io/api/cli#convert

Should this be handled in the converter? If yes, I can push a PR to fix it.

normalnyi commented 4 years ago

Yes, it should be implemented here, but I suggest naming it as a different format rather than https://github.com/heartexlabs/label-studio-converter/blob/master/label_studio_converter/converter.py#L34