Sentence Tokenization - Githubissues

chrisjbryant commented 8 years ago

Hi,

I've run across a few cases where spaCy doesn't sentence tokenize as expected: E.g. "The party will finish quite late so we've decided to provide a bus which would take you to the hotel safely." ->

The party will finish quite late
so we've decided to provide a bus which would take you to the hotel safely.

In the few examples I've observed, it always seem to happen around "so". I can see why it makes a split there, but is otherwise kind of annoying!

Is there any work-around? Thanks.

bittlingmayer commented 8 years ago

More generally, I'm wondering where to report these sort of issues so you can get a proper overview of them, eg https://api.spacy.io/displacy/index.html?full=for+another+great+trade

honnibal commented 8 years ago

Hmm.

The sentence boundary question is very tricky. On the one hand, I think the annotation is sort of correct! spaCy is trained to handle a variety of text types, including less formal English, where you can't rely on punctuation. I think this is a case where it's hard to provide one model to rule them all.

We could go in a number of different directions here. I'd appreciate your thoughts on which sound appealing to you.

Build in more mode-switching logic, implicitly, based on document features. SpaCy would look at the way the whole input text is capitalised, spelled, and punctuated, and use that to condition its expectations about the grammatical structure. With this strategy, it learns that in a text which has no capitalization or punctuation, 'apple' might mean 'Apple'. But in a text with otherwise rigorous orthography, 'apple' cannot mean 'Apple'. This is a little bit ahead of the current research, which is strictly sentence-by-sentence. But this approach is not difficult to do, given the current spaCy architecture. So we'll probably at least try this.
Offer more models, some of which are better suited to less formal text, others are better for written text. We firmly believe that custom models will help people a lot. But we want them to be good. We don't want you to have to experiment between 5 different models, and find they all do about the same on your data.
Add some run-time hyper-parameters. This looks simple, and it's what most similar software would do. I am against this. I think most ML and NLP libraries give you too many stupid fucking knobs to twiddle. It also chains the library to minute details. If you describe the knob in general terms, its action is completely opaque. But if you describe exactly what it does, you've tied the API to algorithmic details.
Help you write custom rule logic, to pre- or post-process around the problem. There will always be weird failures, and some of them will be really big, super specific problems. If you're going to give a demo and your stupid system confidently predicts "Belieber" is a Russian politician, you need to fix that. We have this well set up for entities, but not so much for sentence boundaries and the parser. The capabilities are all there in the underlying system: you can constrain the parser to prevent certain decisions, so long as at least one action is available in each state, it's happy. I just need to make this more convenient in the API.
Alternatively, we can help you solve the problem at the modelling level. Probably we can force the model to fit a few ad hoc examples without destroying its behaviour on the rest of the language. The model has millions of parameters, and the training is online. So we could simply call parser.train() a few times on a few sentences that you're failing on.

I suggest 5) as a good approach. It's a little less explicit than 4), but it has the advantage of being a do-once sort of process. You can confine the ad hoc adjustments to a background task, and not have the logic littered around the rest of your application.

On the other hand, it might be a little more work, conceptually. In lines of code the solution would be a similar size, but the user has to understand a little bit more about how the system works to make good use of this approach.

chrisjbryant commented 8 years ago

Wow - thanks for such a detail response!

I think I'd probably agree that, in retrospect, this is probably just a case that slips through the net and there will always be some weird exceptions.

I'm actually working with 2 types of text: potentially ungrammatical texts and then their grammatical corrected versions, so It's no surprise that sometimes the sentence boundaries are different between the noisy/clean data.

I appreciate the offers to help write custom rule logic or add some extra training examples to the model, but, to be honest, I don't think I'd call this a "Russian politician Belieber" situation. I was mainly interested to hear if this was something that could be handled easily.

On further thought, it might just be easiest for me to check whether a given non-terminal Span ends with {.?!} (or similar) and use that to help prune these rare exceptions from the tokenised sentences. While you could add something similar to SpaCy, I think you'd then limit your ability to sent tokenise badly punctuated text, so from your perspective, its probably not worth doing. If I find some more systematic problems that really do break things, I'll be sure to get in touch though. :)

P.S. "I think most ML and NLP libraries give you too many stupid fucking knobs to twiddle." made my day. Couldn't agree more; I really love the rationale behind SpaCy!

honnibal commented 8 years ago

If you feel okay about having a little bit of Cython code, you could probably solve this elegantly by replacing the function that tells the parser whether the sentence-break action is valid. Your code will depend on spaCy internals, so you'll have to watch out for future versions breaking your code. But these details have been stable for some time, and updating probably won't be hard.

from spacy.en import English

from spacy.syntax.parser cimport Parser
from spacy.syntax.parser.arc_eager cimport Transition

nlp = spacy.en.English()
# Get a C-typed reference to the parser, so we access its C members
cdef Parser parser = nlp.parser
# Find the transition that applies BREAK
cdef Transition break_transition = parser.moves.lookup_transition('B')
parser.moves.c[break_transition.clas].is_valid = my_break_is_valid

The spacy.syntax.parser.Parser class holds an instance of spacy.syntax.arc_eager.ArcEager, at .moves. This instance controls the transition system: it tells the parser which actions are legal given the current parse state, and knows how to apply them. This is achieved by having an array of structs, with each struct holding function pointers. So you could write a replacement for this function:

https://github.com/honnibal/spaCy/blob/master/spacy/syntax/arc_eager.pyx#L228

(Don't be fooled by the fact that this is on a class --- it's just there to collect the functions together, because loose functions gets messy sometimes. It's just a plain old C function.)

Long story short: nlp.parser.moves.c[break_clas].is_valid is just a function pointer. You can write another function and replace it, so long as you match the signature, write the function in Cython, and replace it in Cython. This way you'll get the best parse subject to the constraints you require.

bittlingmayer commented 8 years ago

Here is another one: "Cool! Glows in the dark!"

It's similar because the punctuation is being ignored as a signal, but opposite because it's seeing 1 sentence where there are 2 (ie user added incorrect punctuation) rather than 2 where there is 1.

Why it's hard:

NSUBJ of "Glows" is missing. (As it happens have a large dataset full of English where the NSUBJ is generally only implied ("Will use again!", "Love it!", "Worked great!", "Working great!", "Really recommend it!").)

Why it's easy:

'!' is a strong signal of a sentence boundary, unlike '.'.
"Cool!" presumably occurs in many datasets

I think the long-term solution is to estimate the text quality and make that part of the model.

By the way, it works if I add another space after '!' ("Cool! Glows in the dark!"). (So my short term solution is to replace '! ' with '! ' + ' ' before giving sentences to spaCy.)

bittlingmayer commented 8 years ago

Very unexpected result from my short term solution to add an extra space: while it fixed that one case, in all the other cases (in a sample of 300, where most have "! ") - where it had previously been working correctly - it failed to find sentence boundary.

Example: They look so real! Wonderful seller to deal with, too! A++++

honnibal commented 8 years ago

Hmm.

I think a trick I'm playing in the training might be hurting more than it helps.

I add a small amount of noise to the training data: on some texts I randomly add or delete punctuation, transform spaces into newlines, etc. The idea was to make the model more robust to these variations. But I never went the extra step of making the noise particularly realistic, so it might have the effect of making the model miss important clues, without learning the underlying signal of what's common to well and poorly punctuated texts.

bittlingmayer commented 8 years ago

Data augmentation is always playing with fire. :-) We actually did something rather similar and now I'm wondering what problems it's causing.

I suppose, because the way people type 'badly' is far from random, randomly lowercasing or uppercasing and removing casing, punct, chars and spaces make sense, but adding case, punctuation, spaces and chars should be done with care.

That said I appreciate very much the fundamental commitment to handle real-world ie dirty data, since as you see, we have to handle it too.

If you need our sample set, let me know. Also, is the dirtifier somewhere in this repo? Could be a valuable piece of code on its own.

honnibal commented 8 years ago

https://github.com/honnibal/spaCy/blob/master/bin/parser/train.py#L43

I had the idea to do a lot more with this than I have. It'd be nice to be running a service where you could noise the data how you like, and we'd train you a model.

The underlying mechanism here is super flexible. spaCy has a unique feature that no other current NLP system offers. We train on whole documents, and do a Levenshtein alignment of our tokenizer's tokens against the tokens in the gold standard. The labels of misaligned tokens are treated as latent variables in the structured prediction.

You can therefore mutate the underlying text however you like. But if it doesn't align, you don't get to supervise those tokens. This requirement could be relaxed a little --- we could allow you to specify how to infer the gold labels for the corrupted tokens.

The CV people have shown that data augmentation can be a powerful technique. Of course, we need to get it right :). At the moment my data augmentation system results in ~0.1 to ~0.2% accuracy improvements, so I've kept it in. But it might be that there's some underlying problem.

bittlingmayer commented 8 years ago

I agree, the fundamental spaCy model is very sensible.

Looking at add_noise + \_corrupt, what I observed still does not make sense to me, so I just want to make sure I described the behaviour clearly.

That is, I don't understand why spaCy:

correctly learns that "!" + 1 space is almost always a sentence boundary
BUT
incorrectly learns that "!" + 2 spaces mid-sentence is never a sentence boundary

I don't see how add_noise would make "!" + 2 spaces occur mid-sentence

overwhelmingly more than "!" + 1 space occurs mid-sentence
overwhelmingly more than "!" + 2 space occurs at a sentence boundary I would be interested to know if most of your original training data has 1 space after sent-final punct or 2.

Also, one other point: I would not replace original lines with noisy, I would only add noisy lines. (Pardon me if my mental compiler interpreted the code incorrectly, but I think it's replacing only, no?)

I definitely think noise deserves a standalone lib. :-) Obviously the proper way to do is to have real data + parallel human-corrected versions and learn how to convert correct data back into real data.

There's a question too though whether things like subject dropping or multiple exclamation marks are really noise or if they're part of the language the way that don't or gotta is.

honnibal commented 8 years ago

It's hard to say why the model would be making that prediction. Have a look here for an example of stepping through the parser's states:

https://github.com/honnibal/spaCy/blob/master/services/displacy.py#L195

When you get to the critical decision, you might not be able to interrogate the state so well from Python --- you might need to have a Cython module to get at the information you need. The interesting stuff will be in thinc.api.Example.c.features. The sparse dot product of the features and weights is here: https://github.com/honnibal/thinc/blob/master/thinc/model.pyx#L54 . To print or use Python from within a nogil function you have to use the with gil: context manager.

Re the replacement of the original lines: we make ~25 training iterations, and we noise the docs from within the loop. So we can expect that all documents will be trained on in their original form at least once, and that they'll never be noised in just the same way twice.

Now, what's noise and what's not is of course subjective :). I'm using the term pretty loosely. Actually I'd like to try transforming statements into questions and exclamations, and also training on noun phrases cut out of larger texts, to make models that are especially suited to conversational contexts.

bittlingmayer commented 8 years ago

I suppose when I have a chance I should try just turning add_noise off and see what it changes.

I would perhaps naively assume that at display.py#L195, after tokenisation, when we're stepping through the tokens, the existence of an extra space char is no longer represented.

By the way, can the parser see ahead? Or is the prediction about the current token being final made only based on the current token and the preceding tokens?

Re more data augmentation, covering those types of sentence-like strings is a goal of ours too. But rather than use those more risky operations, we will end up having humans annotate such cases, as those cases occur naturally anyway in the data that we will have annotated. The middle road is to machine generate and human review for sanity.

honnibal commented 8 years ago

The parser has features that look two words forward. I didn't see accuracy benefit from looking further.

There's also a way for the parser to "change its mind", even though the transitions are selected greedily: transitions are allowed to over-rule previous decisions, allowing a form of back-tracking. See here: https://aclweb.org/anthology/D/D15/D15-1162.pdf

bittlingmayer commented 8 years ago

Oh, that's great. That's probably roughly how a human does it. :-) I wonder about training it and running it backwards too.

By the way, I realised the parse of for another great trade is fine, I just hadn't realised that displaCy gives a summarised view of an NN if it's not the head of the tree.

arendu-zz commented 8 years ago

Is there anyway to disable sentence tokenization? or just split on white space?

siddsach commented 7 years ago

Helpful discussion. Just wanted to note that I love spacy and use it for almost everything, but for sentence tokenization in particular, NLTK has been more accurate in every case I've dealt with, and it also allows you to specify abbreviations in your input data that may normally be rare. Sentence tokenization is definitely both something where full accuracy matters a lot, and something where a particular datasource will have very different patterns than other ones because of the nature of abbreviations. The ability to symbolically add rules would be much appreciated.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Sentence Tokenization #231