Closed chrisjbryant closed 8 years ago
More generally, I'm wondering where to report these sort of issues so you can get a proper overview of them, eg https://api.spacy.io/displacy/index.html?full=for+another+great+trade
Hmm.
The sentence boundary question is very tricky. On the one hand, I think the annotation is sort of correct! spaCy is trained to handle a variety of text types, including less formal English, where you can't rely on punctuation. I think this is a case where it's hard to provide one model to rule them all.
We could go in a number of different directions here. I'd appreciate your thoughts on which sound appealing to you.
parser.train()
a few times on a few sentences that you're failing on.I suggest 5) as a good approach. It's a little less explicit than 4), but it has the advantage of being a do-once sort of process. You can confine the ad hoc adjustments to a background task, and not have the logic littered around the rest of your application.
On the other hand, it might be a little more work, conceptually. In lines of code the solution would be a similar size, but the user has to understand a little bit more about how the system works to make good use of this approach.
Wow - thanks for such a detail response!
I think I'd probably agree that, in retrospect, this is probably just a case that slips through the net and there will always be some weird exceptions.
I'm actually working with 2 types of text: potentially ungrammatical texts and then their grammatical corrected versions, so It's no surprise that sometimes the sentence boundaries are different between the noisy/clean data.
I appreciate the offers to help write custom rule logic or add some extra training examples to the model, but, to be honest, I don't think I'd call this a "Russian politician Belieber" situation. I was mainly interested to hear if this was something that could be handled easily.
On further thought, it might just be easiest for me to check whether a given non-terminal Span ends with {.?!} (or similar) and use that to help prune these rare exceptions from the tokenised sentences. While you could add something similar to SpaCy, I think you'd then limit your ability to sent tokenise badly punctuated text, so from your perspective, its probably not worth doing. If I find some more systematic problems that really do break things, I'll be sure to get in touch though. :)
P.S. "I think most ML and NLP libraries give you too many stupid fucking knobs to twiddle." made my day. Couldn't agree more; I really love the rationale behind SpaCy!
If you feel okay about having a little bit of Cython code, you could probably solve this elegantly by replacing the function that tells the parser whether the sentence-break action is valid. Your code will depend on spaCy internals, so you'll have to watch out for future versions breaking your code. But these details have been stable for some time, and updating probably won't be hard.
from spacy.en import English
from spacy.syntax.parser cimport Parser
from spacy.syntax.parser.arc_eager cimport Transition
nlp = spacy.en.English()
# Get a C-typed reference to the parser, so we access its C members
cdef Parser parser = nlp.parser
# Find the transition that applies BREAK
cdef Transition break_transition = parser.moves.lookup_transition('B')
parser.moves.c[break_transition.clas].is_valid = my_break_is_valid
The spacy.syntax.parser.Parser
class holds an instance of spacy.syntax.arc_eager.ArcEager
, at .moves
. This instance controls the transition system: it tells the parser which actions are legal given the current parse state, and knows how to apply them. This is achieved by having an array of structs, with each struct holding function pointers. So you could write a replacement for this function:
https://github.com/honnibal/spaCy/blob/master/spacy/syntax/arc_eager.pyx#L228
(Don't be fooled by the fact that this is on a class --- it's just there to collect the functions together, because loose functions gets messy sometimes. It's just a plain old C function.)
Long story short: nlp.parser.moves.c[break_clas].is_valid
is just a function pointer. You can write another function and replace it, so long as you match the signature, write the function in Cython, and replace it in Cython. This way you'll get the best parse subject to the constraints you require.
Here is another one: "Cool! Glows in the dark!"
It's similar because the punctuation is being ignored as a signal, but opposite because it's seeing 1 sentence where there are 2 (ie user added incorrect punctuation) rather than 2 where there is 1.
Why it's hard:
Why it's easy:
I think the long-term solution is to estimate the text quality and make that part of the model.
By the way, it works if I add another space after '!' ("Cool! Glows in the dark!").
(So my short term solution is to replace '! '
with '! ' + ' '
before giving sentences to spaCy.)
Very unexpected result from my short term solution to add an extra space: while it fixed that one case, in all the other cases (in a sample of 300, where most have "! ") - where it had previously been working correctly - it failed to find sentence boundary.
Example: They look so real! Wonderful seller to deal with, too! A++++
Hmm.
I think a trick I'm playing in the training might be hurting more than it helps.
I add a small amount of noise to the training data: on some texts I randomly add or delete punctuation, transform spaces into newlines, etc. The idea was to make the model more robust to these variations. But I never went the extra step of making the noise particularly realistic, so it might have the effect of making the model miss important clues, without learning the underlying signal of what's common to well and poorly punctuated texts.
Data augmentation is always playing with fire. :-) We actually did something rather similar and now I'm wondering what problems it's causing.
I suppose, because the way people type 'badly' is far from random, randomly lowercasing or uppercasing and removing casing, punct, chars and spaces make sense, but adding case, punctuation, spaces and chars should be done with care.
That said I appreciate very much the fundamental commitment to handle real-world ie dirty data, since as you see, we have to handle it too.
If you need our sample set, let me know. Also, is the dirtifier somewhere in this repo? Could be a valuable piece of code on its own.
https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L43
I had the idea to do a lot more with this than I have. It'd be nice to be running a service where you could noise the data how you like, and we'd train you a model.
The underlying mechanism here is super flexible. spaCy has a unique feature that no other current NLP system offers. We train on whole documents, and do a Levenshtein alignment of our tokenizer's tokens against the tokens in the gold standard. The labels of misaligned tokens are treated as latent variables in the structured prediction.
You can therefore mutate the underlying text however you like. But if it doesn't align, you don't get to supervise those tokens. This requirement could be relaxed a little --- we could allow you to specify how to infer the gold labels for the corrupted tokens.
The CV people have shown that data augmentation can be a powerful technique. Of course, we need to get it right :). At the moment my data augmentation system results in ~0.1 to ~0.2% accuracy improvements, so I've kept it in. But it might be that there's some underlying problem.
I agree, the fundamental spaCy model is very sensible.
Looking at add_noise
+ \_corrupt
, what I observed still does not make sense to me, so I just want to make sure I described the behaviour clearly.
That is, I don't understand why spaCy:
I don't see how add_noise
would make "!" + 2 spaces occur mid-sentence
Also, one other point: I would not replace original lines with noisy, I would only add noisy lines. (Pardon me if my mental compiler interpreted the code incorrectly, but I think it's replacing only, no?)
I definitely think noise deserves a standalone lib. :-) Obviously the proper way to do is to have real data + parallel human-corrected versions and learn how to convert correct data back into real data.
There's a question too though whether things like subject dropping or multiple exclamation marks are really noise or if they're part of the language the way that don't or gotta is.
It's hard to say why the model would be making that prediction. Have a look here for an example of stepping through the parser's states:
https://github.com/honnibal/spaCy/blob/master/services/displacy.py#L195
When you get to the critical decision, you might not be able to interrogate the state so well from Python --- you might need to have a Cython module to get at the information you need. The interesting stuff will be in thinc.api.Example.c.features
. The sparse dot product of the features and weights is here: https://github.com/honnibal/thinc/blob/master/thinc/model.pyx#L54 . To print or use Python from within a nogil function you have to use the with gil:
context manager.
Re the replacement of the original lines: we make ~25 training iterations, and we noise the docs from within the loop. So we can expect that all documents will be trained on in their original form at least once, and that they'll never be noised in just the same way twice.
Now, what's noise and what's not is of course subjective :). I'm using the term pretty loosely. Actually I'd like to try transforming statements into questions and exclamations, and also training on noun phrases cut out of larger texts, to make models that are especially suited to conversational contexts.
I suppose when I have a chance I should try just turning add_noise off and see what it changes.
I would perhaps naively assume that at display.py#L195, after tokenisation, when we're stepping through the tokens, the existence of an extra space char is no longer represented.
By the way, can the parser see ahead? Or is the prediction about the current token being final made only based on the current token and the preceding tokens?
Re more data augmentation, covering those types of sentence-like strings is a goal of ours too. But rather than use those more risky operations, we will end up having humans annotate such cases, as those cases occur naturally anyway in the data that we will have annotated. The middle road is to machine generate and human review for sanity.
The parser has features that look two words forward. I didn't see accuracy benefit from looking further.
There's also a way for the parser to "change its mind", even though the transitions are selected greedily: transitions are allowed to over-rule previous decisions, allowing a form of back-tracking. See here: https://aclweb.org/anthology/D/D15/D15-1162.pdf
Oh, that's great. That's probably roughly how a human does it. :-) I wonder about training it and running it backwards too.
By the way, I realised the parse of for another great trade is fine, I just hadn't realised that displaCy gives a summarised view of an NN if it's not the head of the tree.
Is there anyway to disable sentence tokenization? or just split on white space?
Helpful discussion. Just wanted to note that I love spacy and use it for almost everything, but for sentence tokenization in particular, NLTK has been more accurate in every case I've dealt with, and it also allows you to specify abbreviations in your input data that may normally be rare. Sentence tokenization is definitely both something where full accuracy matters a lot, and something where a particular datasource will have very different patterns than other ones because of the nature of abbreviations. The ability to symbolically add rules would be much appreciated.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hi,
I've run across a few cases where spaCy doesn't sentence tokenize as expected: E.g. "The party will finish quite late so we've decided to provide a bus which would take you to the hotel safely." ->
In the few examples I've observed, it always seem to happen around "so". I can see why it makes a split there, but is otherwise kind of annoying!
Is there any work-around? Thanks.