Closed smartschat closed 7 years ago
spaCy speaks text, not Treebankese :).
You need to unescape your -RRB-
, -LRB-
, ''
, `` etc. These aren't words, so spaCy isn't trained on them, and the model has no idea what they might mean.
Thank you for the fast reply! However, even after replacing the bracket and quotation tokens the parsing/sentence splitting produces similar output.
In contrast, when running
doc = nlp("Dr Conrad Murray, who police say is not a suspect, was at Jackson's mansion and tried to revive him before he died. Assistant Special Agent Michael Flanigan of the Las Vegas branch of the Drug Enforcement Agency -LRB- DEA -RRB- said the operation was likely to take ``a couple of hours''.")
the sentence splitting is correct.
Oh right. You didn't run the tagger.
nlp.tagger(doc)
Btw, are you interested in getting cort
ported to spaCy? I would definitely support that :)
There's really not so much to do...I even have a beam parser in another branch:
Parser: https://github.com/explosion/spaCy/blob/july16/spacy/syntax/beam_parser.pyx
Structured Averaged Perceptron and structured MLP: https://github.com/explosion/spaCy/blob/july16/spacy/syntax/_neural.pyx
The parser uses an imitation learning strategy with a dynamic oracle, so the path through the transition-space is left latent. I've had good success with using the maximum violation update with beam-search to train this. The code above also supports an inexact memoisation mechanism: a hash of each state is calculated, including the history, so that strictly dominated states can be pruned from the search.
Thanks, that resolved the issue!
I am very interested in getting cort
ported to spaCy! I'll contact you.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
When supplying spaCy with pre-computed tokenization I sometimes get weird sentence segmentation results.
An example:
The output then is:
I'm running spaCy 1.6 with Python 3.5.2 on Debian GNU/Linux 8 (jessie) 64-bit.