explosion / spaCy

šŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

Is it preferable to train models for NER, POS tagging and dependency parsing tasks on the same dataset? Or can they be trained on different datasets but with the same tagset? Also, could someone please explain how these trained models are then combined inside the package? #9178

Closed kanayer closed 3 years ago

kanayer commented 3 years ago

Discussed in https://github.com/explosion/spaCy/discussions/3056

Originally posted by **ines** December 17, 2018 This thread bundles discussion around **adding pre-trained models for new languages** (and improving the existing language data). A lot of information and discussion has been spread over various different issues (usually specific to the language), which made it more difficult to get an overview. [See here](https://spacy.io/models) for the available pre-trained models, and [this page](https://spacy.io/usage/models#languages) for all languages currently available in spaCy. Languages marked as "alpha support" usually only include tokenization rules and various other rules and [language data](https://spacy.io/usage/adding-languages). ## How to go from alpha support to a pre-trained model The process requires the following steps and components: * **Language data**: shipped with spaCy, [see here](https://github.com/explosion/spaCy/tree/master/spacy/lang/). The tokenization should be reliable, and there should be a [tag map](https://spacy.io/usage/adding-languages#tag-map) that maps the tags used in the training data to coarse-grained tags like `NOUN` and optional morphological features. * **Training corpus:** the model needs to be trained on a suitable corpus, e.g. an existing [Universal Dependencies treebank](https://github.com/UniversalDependencies). Commercial-friendly treebank licenses are always a plus. Data for tagging and parsing is usually easier to find than data for named entity recognition ā€“ in the long term, we want to do more data annotation ourselves using [Prodigy](https://prodi.gy), but that's obviously a much bigger project. In the meantime, we have to use other available resources (academic etc.). * **Data conversion:** spaCy comes with a range of built-in converters via the `spacy convert` command that take `.conllu` files and output spaCy's JSON format. [See here](https://spacy.io/usage/training#spacy-train-cli) for an example of a training pipeline with data conversion. Corpora can have very subtle formatting differences, so it's important to check that they can be converted correctly. * **Training pipeline:** if we have language data plus a suitable training corpus plus a conversion pipeline, we can run `spacy train` to train a new model. With our new internal model training infrastructure, it's now much easier for us to integrate new pipelines and train new models. > āš ļø **Important note:** In order to train and distribute "official" spaCy models, we need to be able to integrate and reproduce the full training pipeline whenever we release a new version of spaCy that requires new models (so we can't just upload a model trained by someone else). ## Ideas for how to get involved Contributing to the models isn't always easy, because there are a lot of different things to consider, and a big part of it comes down to sourcing suitable data and running experiments. But here are a few ideas for things that can move us forward: ### 1ļøāƒ£ Difficulty: good for beginners * Proofread and correct the existing language data for [a language of your choice](https://github.com/explosion/spaCy/tree/master/spacy/lang/). There can always be typos or mistakes ported over from a different resource. * Write tokenizer tests with expected input / output. It's always really helpful to have examples of how things should work, to ensure we don't accidentally introduce regressions. Tests should be "fair" and representative of what's common in general-purpose texts. While edge cases and "tricky" examples can be nice, they shouldn't be the focus of the tests. Otherwise, we won't actually get a realistic picture of what works and what doesn't. See the [English tests](https://github.com/explosion/spaCy/tree/master/spacy/tests/lang/en) for examples. šŸ“– **Relevant documentation:** [Adding languages](https://spacy.io/usage/adding-languages), [Tokenization](https://spacy.io/usage/linguistic-features#section-tokenization), [Test suite Readme](https://github.com/explosion/spaCy/tree/master/spacy/tests) ### 2ļøāƒ£ Difficulty: advanced * Contribute a noun chunker for the language of your choice. This is a method that extracts base noun phrases from the parser - see the docs [here](https://spacy.io/usage/adding-languages#syntax-iterators). * Add a [tag map](https://spacy.io/usage/adding-languages#tag-map) for a language and its treebank (e.g. [Universal Dependencies](https://github.com/UniversalDependencies)). The tag map is keyed by the fine-grained part-of-speech tag (`token.tag_`, e.g. `"NNS"`), mapped to the coarse-grained tag (`token.pos_`, e.g. `"NOUN"`) and other morphological features. The tags in the tag map should be the tags used by the treebank. * Experiment with training a model. Convert the training and development data using `spacy convert` and run `spacy train` to train the model. [See here](https://spacy.io/usage/training#spacy-train-cli) for an example. (Note that most corpora don't come with NER annotations, so you'll usually only be able to train the tagger and parser). It might work out-of-the-box straight away ā€“ or it might require some more formatting and pre-processing. Finding this out will be very helpful. You can share your results and the reproducible commands to use in this thread. * Prepare a raw text corpus from the [CommonCrawl](http://commoncrawl.org/) or a similar resource for the language you want to work on. Raw unlabelled text can be used to train the word vectors, estimate the unigram probabilities and ā€“ coming in `v2.1.0` ā€“ pre-training a language model similar to BERT/Elmo/ULMFiT etc. (see #2931). We only need the cleaned, raw text ā€“ for example as a `.txt` or `.jsonl` file: ```json {"text": "This is a paragraph of raw text in some language"} ``` When using other resources, make sure the data license is compatible with spaCy's MIT license and ideally allows commercial use (since many people use spaCy commercially). Examples of suitable licenses are CC, Apache, MIT. Examples of unsuitable licenses are CC BY-NC, CC BY-SA, (A)GPL. šŸ“– **Relevant documentation:** [Adding languages](https://spacy.io/usage/adding-languages), [Training via the CLI](https://spacy.io/usage/training#spacy-train-cli) --- If you have questions, feel free to leave a comment here. We'll also be updating this post with more tasks and ideas as we go. [EDIT, February 2021: since we have the discussions board on Github, there is a whole forum on [language support](https://github.com/explosion/spaCy/discussions/categories/language-support-models) where you can create a new thread to discuss language-specific collaborations, issues, progress, etc...]
polm commented 3 years ago

Not a bug, moving to Discussions.