💫 Language data for Turkish, Croatian and Romanian

ines commented 6 years ago

To make it easier for the spaCy community to contribute to new languages, I've started adding language data skeletons, including the language class setup and the minimum amount of data required to make them available to spaCy. In particular, I've been focussing on adding languages that are available via Universal Dependencies and licensed under CC BY-SA (permitting commercial use). This will allow us – and other spaCy users – to train language models later on.

If you speak any of the languages listed below, feel free to contribute! 🎉 Unfortunately, I don't speak any of them, so I've only added stop words I found in an open-source collection (which can vary in quality – so it's always useful to proofread them) and a few abbreviations or contractions I found online. But I hope that those are useful to give you an idea of what could be included in the tokenizer exception patterns.

Language	Source	UD data license
Turkish	`spacy/lang/tr`	CC BY-NC-SA, CC BY-SA
Croatian	`spacy/lang/hr`	CC BY-SA
Romanian	`spacy/lang/ro`	CC BY-SA

Documentation

Adding languages
spaCy 101 guide for beginners

DuyguA commented 6 years ago

Hello @ines ,

Native Turkish speaker here :+1: I went over the stopwords list and made a bit bad encoding cleaning. Here you can see the pull request: https://github.com/explosion/spaCy/pull/1564

Tokenizer exception, I'll have a closer look tonight. Do you need anything else? In case you need assistance, you can always ask me directly.

I'll make an open source morphological analyzer soon anyway, if you're interested it can be integrated into Spacy.

ines commented 6 years ago

@DuyguA Ah, only saw your comment now (after merging and commenting on your PR). And omg, yes, this would be amazing 🎉 If you want to test your morphological analyser with spaCy, a nice way to implement this would probably be to start off with a custom pipeline component. This lets you test it in an isolated environment – and once it works well, it'll be super easy to integrate into the core library.

Some other ideas:

Are the norm exceptions at all relevant in Turkish? For example, are there different spellings of the same words that should be normalised to a common spelling? The norm exceptions don't actually modify the token and only set the token.norm attribute – but they will be used as a feature in the statistical model, so words with the same norm will receive similar representations. This could potentially be very powerful.
Lexical attribute getters like like_num can also be quite fun to implement – depending on how number words work in the language.

DuyguA commented 6 years ago

No norm exception comes to my mind immediately indeed, but I'll think again.
Why not...we have as many as other languages have forty, thirty, ninety, thirteen etc :)

OK, I decided to make a review and prepare Turkish linguistic data alltogether :100:

Lemmatizer is not a easy task in Turkish case, it's morphological analysis dependent indeed. I can first implement easy catches, then go onto more complicated tasks.

ines commented 6 years ago

Oh btw, forgot to add – also in case others come across this issue later and want to help out:

Another nice thing to have would be a few very simple tokenization tests. It doesn't have to be very sophisticated – something simple like this with input vs. expected output would be totally fine. Even if spaCy currently gets it wrong, we can simply xfail that test case for now and hopefully make it pass at some point in the future. Tests are always good as a little sanity check, to make sure we don't accidentally break things for a particular language when pushing new updates.

cbilgili commented 6 years ago

@ines Hello there, in the sake of starting some process in Turkish, I have adopted Turkish lemmatizer. It is lookup based around with 1.3m word. #1672

dejanmarich commented 6 years ago

Hi, I would like to help with croatian language, you can contact me, hope to hear from you.

ines commented 6 years ago

@dejanmarich Thanks, sounds great! 👍

A good place to start would be to poofread the stop words and check if they're all correct (I copied the list from an online resource – but those often include mistakes).

You could also add some tokenizer exceptions for abbreviations and contractions. Does Croatian have contractions (like do + not = don't or going + to = gonna in English) that the tokenizer should know about? I found this list online – are any of those relevant?

You can find more details on the language data components in the "Adding languages" guide. You can also take some inspiration from the other languages.

dejanmarich commented 6 years ago

@ines We have 7 cases in Croatian, so I added more stop words (in different cases). Also added new words.

ufukhurriyetoglu commented 6 years ago

Hi everyone !

I want to contribute also for this amazing library. I have experience in Turkish and English language processing. I don't have much time but I think I can spend 4 hours each week in regular manner.

I don't have experience about contributing such project so I need a bit guidance.

I will be glad to hear from you.

Regards.

ines commented 6 years ago

Hi @ufukhurriyetoglu – thanks a lot for your interest in contributing to spaCy!

This page has a good overview of how spaCy's language data works, and the individual components: https://spacy.io/usage/adding-languages You can find inspiration and examples of the other language data in spacy/lang.

If you're new to spaCy, you might also want to check out the spaCy 101 guide, which explains the most important concepts and components of the library: https://spacy.io/usage/spacy-101

For more details on how to contribute to spaCy, and how to set up your local development environment, see our CONTRIBUTING.md.

@DuyguA has made some great contributions to Turkish in the past, so maybe she'll also have some ideas for what's still missing, or a nice and easy first project to get you started!

ufukhurriyetoglu commented 6 years ago

Hi @ines I will start with references you sent. And I will start to contribute as soon as possible.

DuyguA commented 6 years ago

Sorry for late answer both above, notification email went into spam.

@ufukhurriyetoglu I'm more than happy to help! I'll think of a first task quickly, would that be OK?

gbrova commented 6 years ago

Hi @ines, I came across someone who wanted to use spaCy on Romanian text the other day, and I'm a native speaker so I'm happy to help.

I also see that @janimo is making some steady progress, so I don't want to duplicate work, but let me know if you need more hands.

vencaslac commented 6 years ago

hi i'd like to take a look at the stop words for Romanian (am a native speaker) as i noticed some duplicates and i think the list can be enhanced

ju0gri commented 5 years ago

Hi @ines I am interested in helping out on adding support for Romanian to spaCy, as I am going to do that for a project soon. I'm also a native speaker. Let me know how i can help.

ines commented 5 years ago

Sounds great 👍

I'm copying over part of my reply from #2580, which explains what's needed to go from alpha language data to a statistical model for a new language.

The process requires the following steps and components:

language data: shipped with spaCy, see here for Romanian. The tokenization should be reliable, and there should be a tag map that maps the tags used in the training data to coarse-grained tags like NOUN and optional morphological features.
training corpus: the model needs to be trained on a suitable corpus, e.g. an existing Universal Dependencies treebank. Commercial-friendly treebank licenses are always a plus. Data for tagging and parsing is usually easier to find than data for named entity recognition – in the long term, we want to do more data annotation ourselves using Prodigy, but that's obviously a much bigger project. In the meantime, we have to use other available resources (academic etc.).
data conversion: spaCy comes with a range of built-in converters via the spacy convert command that take .conllu files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.
training pipeline: if we have language data plus a suitable training corpus plus a conversion pipeline, we can run spacy train to train a new model.

With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models. In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else).

ursachi commented 5 years ago

I will step in for the Romanian part too as I am a Romanian native speaker. @ines, @gbrova, @janimo - do you have any preferences for me to work on? If not, then I will take a deeper look at what is requested and try to bring updates.

ines commented 5 years ago

Merging this with the master thread in #3056!

@ursachi Thanks – I've added a few ideas for contributions to the master thread 🙂

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

💫 Language data for Turkish, Croatian and Romanian #1490

Documentation