ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.14k stars 208 forks source link

deserialize-ATN #230

Closed BeyzaSuna closed 4 years ago

BeyzaSuna commented 4 years ago

In tokenization-module I take this exception: Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with UUID 59627784-3be5-417a-b9eb-8131a7286089 (expected aadb8d7e-aeef-4415-ad2b-8204d6cf042e or a legacy UUID).

at zemberek.tokenization.TurkishTokenizer.(TurkishTokenizer.java:23) ~[?:?] at org.antlr.v4.runtime.atn.ATNDeserializer.deserialize(ATNDeserializer.java:153) ~[?:?] at zemberek.tokenization.antlr.TurkishLexer.(TurkishLexer.java:441) ~[?:?]

ahmetaa commented 4 years ago

Do you use zemberek-full.jar or do you add dependencies to a maven file? This seems like a problem with Antlr class loading conflict.

ahmetaa commented 4 years ago

I remember you were using Zemberek with Solr, can you find out what version of Antlr solr is using? Feel free to write in Turkish, if you prefer.

BeyzaSuna commented 4 years ago

Thanks for your Feedback. I added dependencies to a maven file and use antlr 4.7.2 version. Also ı tried 4.7 version. Both same mistake.What do you mean when you say ‘Solr using Antlr’ ı didnt understand that. Has solr Antlr library?

ahmetaa commented 4 years ago

Solr itself probably has an antlr dependency. which version of solr are you using?

BeyzaSuna commented 4 years ago

I use 8.2.0

ahmetaa commented 4 years ago

Solr core seems to use Antlr. Go here and search for Antlr: https://mvnrepository.com/artifact/org.apache.solr/solr-core/8.2.0

It uses version 4.5.1-1

I dont know, there may be an easy solution for this. But I can think of two painful solutions.

1- Compile solr-core and solr with a newer Antlr version (4.7). 2- Compile zemberek with Antlr 4.5.1-1 (This is not trivial)

I would go for compiling solr solution first. Or, ask in solr forums about it.

Let me know if it does not work.

ahmetaa commented 4 years ago

One other solution is to use full zemberek jar. it hides dependencies so there should not be a conflict.

BeyzaSuna commented 4 years ago

Thank you for your interest. I will try and let you know . Sincerely

BeyzaSuna commented 4 years ago

Hello again.I know this is Solr problem not zemberek but I hope that you know.I compiled solr-core with antlr4.7 version. And I take this exception : org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core search: Cannot parse lines [[, [P:Punc], ; [P:Punc], : [P:Punc], ! [P:Punc], ? [P:Punc], / [P:Punc], . [P:Punc], ' [P:Punc], " [P:Punc], ( [P:Punc], ) [P:Punc], [ [P:Punc], - [P:Punc], ] [P:Punc], { [P:Punc], } [P:Punc], $ [P:Punc], € [P:Punc], £ [P:Punc], ¥ [P:Punc], ₺ [P:Punc; Pr:tele], % [P:Punc], + [P:Punc], ... [P:Punc], … [P:Punc], ‘ [P:Punc], ’ [P:Punc], ” [P:Punc], “ [P:Punc], » [P:Punc], « [P:Punc], a [P:Interj], ab, aba, aba [P:Adj], abadî, abajur, abaküs, abandone, abanî, abanmak, abanoz, abanoz [P:Adj], abartı, abartmak, abaşo, abaşo [P:Adj], abat [P:Adj; A:NoVoicing], abazan [P:Adj], abd, abdal, abdest [A:NoVoicing], abdestbozan, abdesthane, abdiâciz, abdülleziz, abe [P:Interj], abece, aberasyon, abes [P:Adj], abes [P:Adv], abıhayat [A:NoVoicing], abıkevser, abi, abide [P:Adj], abidevî, abis, abiye, abla, ablak [P:Adj; A:NoVoicing], ablatif, ablatya, ablavut [P:Adj], abli, abluka, abo [P:Interj], abone, abone [P:Adj], abonman, aborda, aborjin, abosa [P:Interj], abra, abrakadabra, abrakadabralamak, abramak, abraş, abraş [P:Adj], absorbe, abstraksiyon, abstraksiyonizm, abstre [P:Adj], absürt [P:Adj; A:NoVoicing], abu [P:Interj], abuhava, abuk [P:Adj], abuk [P:Dup;A:NoVoicing, NoSuffix], abuli, abullabut [P:Adj; A:NoVoicing], abur [P:Dup], abus [P:Adj], acaba, acaba [P:Adv], acar [P:Adj], acayip [P:Adj], accelerando [P:Adv], acele, acele [P:Adj], acele [P:Adv], aceleten [P:Adv], acem, acemaşiran, acemborusu [A:CompoundP3sg; Roots:acem-boru], acembuselik, acemi, acemi [P:Adj], acemkürdî, acente, acep [P:Adv], aceze, acı, acı [P:Adj], acıkara, acıklı [P:Adj], acıkmak, acımak, acımasız [P:Adj], acımasız [P:Adv], acımık, acımtırak [P:Adj; A:NoVoicing], acınmak, acırak [P:Adj;A:NoVoicing], acırga, acibe, acil [P:Adj], acilen [P:Adv], aciliyet [A:NoVoicing], aciz [A:LastVowelDrop], âciz [P:Adj], âciz [P:Adv], âcizane [P:Adv], acube, acul [P:Adj], acun, acur, acuze, acyo, aç [P:Adj], aç [P:Adv], açacak, açar, açelya, açgöz [P:Adj], açgözlü [P:Adj], açı, açık, açık [P:Adj], açık [P:Adv], açıkağız, açıkçası [P:Adv], açıkgöz [P:Adj], açıklama, açıklamak, açıklıkölçer, açıktan [P:Adv], açılamak, açılım, açım, açımlamak, açınım, açınmak, açınsamak, açıortay, açıölçer, açıt [A:NoVoicing], açkı, açkılamak, açma, açmak, açmaz, ad, ad [A:Doubling, InverseHarmony; Index:1], ada, adabımuaşeret [A:NoVoicing], adacyo, adak, adaklamak, adale, adalet [A:NoVoicing], adalî, adam, adamak, adamakıllı [P:Adv], adamcağız, adamcıl [P:Adj], adamı, adamkökü [A:CompoundP3sg; Roots:adam-kök], adamotu [A:CompoundP3sg; Roots:adam-ot], adap, adaptasyon, adapte [P:Adj], adaptör, adaş, adavet [A:NoVoicing], aday, adayavrusu [A:CompoundP3sg; Roots:ada-yavru], addetmek [A:Voicing, Aorist_A], addolmak, adedî [P:Adv], adedimürettep, adem, âdem, âdemelması [A:CompoundP3sg; Roots:âdem-elma], ademimerkeziyet [A:NoVoicing], ademiyet [A:NoVoicing], âdemiyet [A:NoVoicing], âdemoğlu [A:CompoundP3sg; Roots:âdem-oğul], âdemotu, adenit [A:NoVoicing], adese, adet, âdet [A:NoVoicing], âdeta [P:Adv], adetimürettep, adıl, adım, adımlamak, adımsayar, adına [P:Adv], adi [P:Adj], adil [P:Adj], adilane [P:Adv], adisyon, adli [P:Adj], adliye, adrenalin, adres, adreslemek, aerobik, aerodinamik, aerodinamik [P:Adj], aeroloji, aerolojik [P:Adj; A:NoVoicing], af [A:Doubling], afacan [P:Adj], afak [A:NoVoicing], afakan, afaki [P:Adj], afal [P:Adj], afal [P:Dup], afallamak, afat [A:NoVoicing], afazi, aferin, aferin [P:Interj], aferist [A:NoVoicing], afet [A:NoVoicing], afet [P:Adj; A:NoVoicing], afetzede, affetmek [A:Voicing, Aorist_A], affettuoso [P:Adv], affeyleme, affeylemek, affolmak, afi, afif [P:Adj] ..... like that

What do you think this error might be caused by? I would really appreciate if you can help.

ahmetaa commented 4 years ago

There seems to be good news and bad news. Good news is, it seems like you skip Tokenizer error caused by antlr mismatch. Because that huge line indicates an error during dictionary loading.

Bad news is, this error is puzzling. Cannot parse lines indicates method

public static RootLexicon load(String... dictionaryLines) in TurkishDictionaryLoader. However, I was quite sure that that method is not called during loading of the default dictionary. Because default dictionary is loaded from a binary protocol buffers file. You can see that file in morphology module resources/tr/lexicon.bin file.

Is it possible to copy the end of that huge exception here? There must be some useful information, like class names and line numbers there.

Also, how are you using Zemberek in the code? Can you post some related lines from your code?

Lastly are you using Windows or another OS?

BeyzaSuna commented 4 years ago

I use Linux and : https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek3StemFilterFactory.java this Zemberek3StemFilterFactory class in 'lucene-solr-analysis' project.

This is some Exception that I take huge errors end : Caused by: zemberek.morphology.lexicon.LexiconException: Cannot parse lines .. at zemberek.morphology.lexicon.tr.TurkishDictionaryLoader.load(TurkishDictionaryLoader.java:98) ~[?:?] at zemberek.morphology.lexicon.RootLexicon$Builder.addDictionaryLines(RootLexicon.java:211) ~[?:?] at zemberek.morphology.lexicon.RootLexicon.fromLines(RootLexicon.java:138) ~[?:?] at zemberek.morphology.TurkishMorphology$Builder.setLexicon(TurkishMorphology.java:285) ~[?:?] at org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory.inform(Zemberek3StemFilterFactory.java:92) ~[?:?]

What should I do?

mdakin commented 4 years ago

Unfortunately this exception message is not very helpful. When I cloned that project I could run the main method of the Zemberek3StemFilterFactory.java with no issues. The project seems to depend on latest zemberek 0.17.1 @BeyzaSuna How do you use this lucene-solr-analysis-turkish library? As a maven dependency? Do you have your code somewhere open so we can have a look?

@ahmetaa

When supplied with a parameter it might try to load a different dictionary: https://github.com/iorixxx/lucene-solr-analysis-turkish/blob/master/src/main/java/org/apache/lucene/analysis/tr/Zemberek3StemFilterFactory.java#L91 but I am not sure if this is the case.

Also note that solr plugin seems to bundle old zemberek (Zemberek2) but I don't think that should cause any issues. https://github.com/iorixxx/lucene-solr-analysis-turkish/tree/master/solr/lib

Also @iorixxx might help us here?

BeyzaSuna commented 4 years ago

Thank you very much for your answer. When I run the main method of the Zemberek3StemFilterFactory.java I take no issues too. But with Solr I dont know why. I just want to integrate Zemberek into the Solr. I didn't make any additions to 'lucene-solr-analysis' project, so unfortunately don't have any code to show you. I just deleted the zemberek2 classes. Then compiled the maven project and added the jar files to Solrhome. Then the mistakes began to come. I'm sorry I asked you so many questions. But I really want to be able to do that.

ahmetaa commented 4 years ago

@BeyzaSuna It is ok to ask questions. This seems to be a pesky issue after all. @mdakin probably pinpointed the problematic place. Please try this, if possible:

Temporarily change the method inform in Zemberek3StemFilterFactory like this:

   @Override
    public void inform(ResourceLoader loader) throws IOException {
            this.morphology = TurkishMorphology.createWithDefaults();
            return;
    }

Then compile and use the jar again. Let us know about the result.

BeyzaSuna commented 4 years ago

It worked. I'm just so glad.This was my first project. Now I'm going to try the other classes.You are great. Thank you very, very much for everything. You have been very helpful. Best Regards

ahmetaa commented 4 years ago

No problem. But probably you would want to find the underlying problem some time in the future. Keep up the good work and good luck.