allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

any guidance on ontonotes prep? #421

Open mprorock opened 2 years ago

mprorock commented 2 years ago

Since ontonotes requires direct licensing from source are there any pointers or scripts to prep for how to convert the corpus format over to the expected train / dev / test splits so that ud_ontonotes.tar.gz can be properly replicated locally?

dakinggg commented 2 years ago

I unfortunately don't have the exact details to provide you, @DeNeutoy might remember some more details, but I believe it was the same splits/processing spacy uses which appear to be referenced in a couple places (https://github.com/explosion/spaCy/issues/5276, https://github.com/explosion/spaCy/issues/3587#issuecomment-483191672).