curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

German POS Tagging marks the alphabet letter as NOUNfor German #36

Closed nimasTT closed 2 years ago

nimasTT commented 4 years ago

in more than 20%, the POS Tagger for German marks the alphabet letters as NOUN in Twitter text. In comparison, the Corenlp does not make this mistake. I am using the online trained models: ` public CatalystAnalyzer() {

        Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
    }
    public List<string> GetNouns(string text, string language)
    {
        Language lang = new Language();

        switch (language.ToLower())
        {
            case "german":
                lang = Language.German;
                break;
            case "english":
                lang = Language.English;
                break;
            case "french":
                lang = Language.French;
                break;
            case "spanish":
                lang = Language.Spanish;
                break;
            default:
                lang = Language.Any;
                break;
        }

        Pipeline nlp;
        try
        {
            nlp = nlpSet[lang];
        }
        catch
        {
            nlp = Pipeline.For(lang);
            nlpSet.Add(lang, nlp);
        }

        var doc = new Catalyst.Document(text, lang);
        nlp.ProcessSingle(doc);

        var tokens = doc.Spans.SelectMany(s => s.Tokens);
        var stopwords = new NLPToolsLib.StopWords();
        var aspects = tokens.Where(s => s.POS == PartOfSpeech.NOUN).Select(s => s.Value).ToList();
        List<string> result = new List<string>();

        foreach (var aspect in aspects)
        {
            if (!stopwords.isStopWord(aspect, LanguageDetection.GetLangIsoCode(language)))
                result.Add(aspect);
        }
        return result;
    }`
theolivenbaum commented 4 years ago

Hi @nimasTT, thanks for the report, will check here what could be the issue with the training data or tokenization rules.

nimasTT commented 4 years ago

thanks. I found also some failures for spanish. For Noun phrases, many word endings are cutted. It seems like a stemming or lematization failure. I am comparing the sentences with http://corenlp.run and would like to change to catalyst instead.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.