NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Transforming corpus to Spacy docbin format #9

Closed wpnbos closed 3 years ago

wpnbos commented 3 years ago

Hi,

I am currently conducting research on weak supervision for NER for the Dutch language, and would like to use your model developed in your 2020 paper as a baseline. Since I'll be working with CoNLL-2002 rather than 2003 for it's Dutch subset, I was wondering if you have any method or tips you could provide me with for converting the ConLL IOB files to Spacy docbin format, as you seemingly have already done so yourself.

Thanks in advance!

plison commented 3 years ago

Yes, the easiest is to use Spacy's conversion tool, see https://spacy.io/api/cli#convert . You can use it to directly convert files in conll format into DocBin.

wpnbos commented 3 years ago

Tried it and it seems to have worked so far! Thanks heaps!