Closed ines closed 5 years ago
Hello @ines ,
Native Turkish speaker here :+1: I went over the stopwords list and made a bit bad encoding cleaning. Here you can see the pull request: https://github.com/explosion/spaCy/pull/1564
Tokenizer exception, I'll have a closer look tonight. Do you need anything else? In case you need assistance, you can always ask me directly.
I'll make an open source morphological analyzer soon anyway, if you're interested it can be integrated into Spacy.
@DuyguA Ah, only saw your comment now (after merging and commenting on your PR). And omg, yes, this would be amazing š If you want to test your morphological analyser with spaCy, a nice way to implement this would probably be to start off with a custom pipeline component. This lets you test it in an isolated environment ā and once it works well, it'll be super easy to integrate into the core library.
Some other ideas:
Are the norm exceptions at all relevant in Turkish? For example, are there different spellings of the same words that should be normalised to a common spelling? The norm exceptions don't actually modify the token and only set the token.norm
attribute ā but they will be used as a feature in the statistical model, so words with the same norm will receive similar representations. This could potentially be very powerful.
Lexical attribute getters like like_num
can also be quite fun to implement ā depending on how number words work in the language.
No norm exception comes to my mind immediately indeed, but I'll think again.
Why not...we have as many as other languages have forty, thirty, ninety, thirteen etc :)
OK, I decided to make a review and prepare Turkish linguistic data alltogether :100:
Lemmatizer is not a easy task in Turkish case, it's morphological analysis dependent indeed. I can first implement easy catches, then go onto more complicated tasks.
Oh btw, forgot to add ā also in case others come across this issue later and want to help out:
Another nice thing to have would be a few very simple tokenization tests. It doesn't have to be very sophisticated ā something simple like this with input vs. expected output would be totally fine. Even if spaCy currently gets it wrong, we can simply xfail
that test case for now and hopefully make it pass at some point in the future. Tests are always good as a little sanity check, to make sure we don't accidentally break things for a particular language when pushing new updates.
@ines Hello there, in the sake of starting some process in Turkish, I have adopted Turkish lemmatizer. It is lookup based around with 1.3m word. #1672
Hi, I would like to help with croatian language, you can contact me, hope to hear from you.
@dejanmarich Thanks, sounds great! š
A good place to start would be to poofread the stop words and check if they're all correct (I copied the list from an online resource ā but those often include mistakes).
You could also add some tokenizer exceptions for abbreviations and contractions. Does Croatian have contractions (like do + not = don't
or going + to = gonna
in English) that the tokenizer should know about? I found this list online ā are any of those relevant?
You can find more details on the language data components in the "Adding languages" guide. You can also take some inspiration from the other languages.
@ines We have 7 cases in Croatian, so I added more stop words (in different cases). Also added new words.
Hi everyone !
I want to contribute also for this amazing library. I have experience in Turkish and English language processing. I don't have much time but I think I can spend 4 hours each week in regular manner.
I don't have experience about contributing such project so I need a bit guidance.
I will be glad to hear from you.
Regards.
Hi @ufukhurriyetoglu ā thanks a lot for your interest in contributing to spaCy!
This page has a good overview of how spaCy's language data works, and the individual components: https://spacy.io/usage/adding-languages You can find inspiration and examples of the other language data in spacy/lang
.
If you're new to spaCy, you might also want to check out the spaCy 101 guide, which explains the most important concepts and components of the library: https://spacy.io/usage/spacy-101
For more details on how to contribute to spaCy, and how to set up your local development environment, see our CONTRIBUTING.md
.
@DuyguA has made some great contributions to Turkish in the past, so maybe she'll also have some ideas for what's still missing, or a nice and easy first project to get you started!
Hi @ines I will start with references you sent. And I will start to contribute as soon as possible.
Sorry for late answer both above, notification email went into spam.
@ufukhurriyetoglu I'm more than happy to help! I'll think of a first task quickly, would that be OK?
Hi @ines, I came across someone who wanted to use spaCy on Romanian text the other day, and I'm a native speaker so I'm happy to help.
I also see that @janimo is making some steady progress, so I don't want to duplicate work, but let me know if you need more hands.
hi i'd like to take a look at the stop words for Romanian (am a native speaker) as i noticed some duplicates and i think the list can be enhanced
Hi @ines I am interested in helping out on adding support for Romanian to spaCy, as I am going to do that for a project soon. I'm also a native speaker. Let me know how i can help.
Sounds great š
I'm copying over part of my reply from #2580, which explains what's needed to go from alpha language data to a statistical model for a new language.
The process requires the following steps and components:
NOUN
and optional morphological features.spacy convert
command that take .conllu
files and output spaCy's JSON format. See here for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly.spacy train
to train a new model.With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models. In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else).
I will step in for the Romanian part too as I am a Romanian native speaker. @ines, @gbrova, @janimo - do you have any preferences for me to work on? If not, then I will take a deeper look at what is requested and try to bring updates.
Merging this with the master thread in #3056!
@ursachi Thanks ā I've added a few ideas for contributions to the master thread š
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
To make it easier for the spaCy community to contribute to new languages, I've started adding language data skeletons, including the language class setup and the minimum amount of data required to make them available to spaCy. In particular, I've been focussing on adding languages that are available via Universal Dependencies and licensed under CC BY-SA (permitting commercial use). This will allow us ā and other spaCy users ā to train language models later on.
If you speak any of the languages listed below, feel free to contribute! š Unfortunately, I don't speak any of them, so I've only added stop words I found in an open-source collection (which can vary in quality ā so it's always useful to proofread them) and a few abbreviations or contractions I found online. But I hope that those are useful to give you an idea of what could be included in the tokenizer exception patterns.
spacy/lang/tr
spacy/lang/hr
spacy/lang/ro
Documentation