explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

Feature request: Language detection #1172

Closed mollerhoj closed 6 years ago

mollerhoj commented 7 years ago

Are there any plans to add language detection to spaCy? If the goal of the project is to be 'the Ruby on Rails of NLP' then I think it would make sense to include this feature.

(Just imagine how magical it will be - we don't even need to call space.load(...), this can be done lazily after the language has been detected ;-). Ok, maybe not a good idea, but a useful feature nonetheless)

A quick google search has lead me to believe the approach taken by the cld2 project is state-of-the-art in this field. Essentially, it uses naive bayes on quadgrams.

I would like to hear your thought on implementing this feature. Would you want to use existing libraries, or so I try to train a classifier from scratch?

I'm about to develop af project where I need high accuracy on language detection on a corpus that has Danish and English sentences mixed together. So I need this feature, and I would love to not have to rely on external libraries.

honnibal commented 7 years ago

This is planned, yes :).

The simple solution can be derived from the .prob values, which give the unigram log probabilities. If each language has these set in the vocab, you should be able to do:

docs = [nlp(text) for nlp in languages]
probs = [(sum(tok.prob for tok in doc), doc) for doc in docs]
prob, doc = max(probs)

This selects the language under which the unigram probabilities are maximised --- i.e. it's unigram Naive Bayes. It's important to also include a prior on the languages as well. It's very useful to have this prior adapt to context. For instance, if you're processing a sequence of texts, you probably want to let the language prediction of the previous text influence the language prediction of the next text.

Depending on the application, it might also be important to pay attention to pre-processing decisions. For instance, let's say you have a Faroese sample with poor HTML cleaning. Then when you process noisy text, suddenly you're tagging it all as Faroese!

In short: improved language detection usually flows from making smart decisions specific to your application. Building more complicated language models is usually counter-productive, because it increases the risk that your language model will unexpected produce a very confident decision, overwhelming your contextual priors.

mollerhoj commented 7 years ago

Thank you honnibal, that's a nice little hack - and definitely enough for my usecase!

bittlingmayer commented 7 years ago

Lang id is non-trivial, the existing libraries are not great and as @honnibal said very application specific.

The list of probabilities that he suggested is great, much better than a set of probabilities that add up to one or less, because often the question is if the probably for a given language is greater than some threshold. For one, it deals with the case when the text is in none of the languages in the list.

But averaging or summing token probabilities has inherent drawbacks. (For comparing eg against a threshold you will want to average by token count.)

  1. Most lang id approaches use char-level probabilities for a reason, space and performance, but also for dealing with out-of-vocabulary tokens.

  2. Parsing information is key. 'Conoce Jim el Google Cloud Platform en Python o JavaScript?' is Spanish, but 'You know Hoy Tengo Ganas De Ti by Miguel Gallardo?' is English.

It sounds like your use case could have a lot of 2, anyway many non-English texts contain a lot of English. But no major lang id lib uses parsing as an input, to my knowledge.

A simple hack could be to remove entities or boost stop words. Parse probability would be useful too.

tiru1930 commented 6 years ago

can you please provide the exact code for this use case,i am trying to load each language and finding the probabilities,which is time consuming an memory consuming, can you please help me.

honnibal commented 6 years ago

@tiru1930 At the moment I would recommend using an external language identification package, unfortunately. We still really want to provide this, but we don't have it yet.

tiru1930 commented 6 years ago

@honnibal thank you , will check on this

diegow88 commented 6 years ago

@honnibal it worked for the English, Spanish and German models but couldn't make it work for the French model. I always get zeros. Apparently prob returns zero for every word. I've tried on Python 2.7 and 3.5. I am running spaCy v1.9.

Thanks!

ines commented 6 years ago

Update: This might be a good use case for the new custom pipeline components in spaCy v2.0! https://spacy.io/usage/processing-pipelines#custom-components

bittlingmayer commented 6 years ago

Just to follow up on earlier comments about the drawbacks of simple averaging, one old-school approach is to use the stopwords only.

SandeepNaidu commented 6 years ago

Loading all language models for detection of languages and iterating through them might bloat the memory of the process? Already we have #1600 for v2 which is requiring some effort of coding and additional processing time to breakup the document (paragraph wise) and do analysis.

nickdavidhaynes commented 6 years ago

Following up here - I wanted to play around with building an extension, so I put together a little pipeline component that integrates the CLD project (https://github.com/nickdavidhaynes/spacy-cld). Since its tied to the NLP pipeline, it won't work as magically as @mollerhoj originally envisoned. But if you need "good enough" language detection as a part of your processing pipeline, this should be relatively easy to incorporate.

cc @ines

clesleycode commented 6 years ago

@nickdavidhaynes is this being incorporated with SpaCy?

nickdavidhaynes commented 6 years ago

@lesley2958 Not as far as I know. It's fairly simple to use in conjunction with spaCy (although let me know if the README isn't clear), but there aren't any plans to bring that package in particular directly into the main spaCy codebase.

ines commented 6 years ago

@lesley2958 One of the main reasons we've decided to open up the processing pipeline API in v2.0 is to make it easier to implement features like this as plugins – like @nickdavidhaynes' package for example. Users who want to add those features to their pipeline can do so easily by installing the plugin and adding it via nlp.add_pipe. Developers who prefer a different approach, or integrating a different library or model can do so by writing their own plugin, without having to worry about the core library.

We also prefer features that we ship with spaCy to be self-contained within the library, instead of adding more third-party dependencies. We might want to add a language detection model to spaCy in the future – but if we do so, it will be its own implementation. In the meantime, we think the plugin ecosystem is a good solution to allow users to add any features they like using any other library – no matter how specific they are.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.