bioinformatics-ua / gimli

Gimli is now part of Neji.
https://github.com/BMDSoftware/neji
14 stars 6 forks source link

problem with parser.launch() #2

Closed Blashyrkh closed 11 years ago

Blashyrkh commented 11 years ago

Hello, I tried the code in the documentation page of gimli (the last one "on-demand annotation of raw sentences")

I am able to load the model, but i got an error with the line parser.launch() because it cannot find the gdep_gimli file as shown below:

run: [INFO] Loading model from file: resources/models/gimli/bc2gm_bw_o2.gz Exception in thread "main" java.io.IOException: Cannot run program "resources/tools/gdep/gdep_gimli": CreateProcess error=2, Impossibile trovare il file specificato at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) at pt.ua.tm.gimli.external.wrapper.ProcessConnector.create(ProcessConnector.java:64) at pt.ua.tm.gimli.external.wrapper.Parser.launch(Parser.java:68) at gimli2.Gimli2.main(Gimli2.java:58) Caused by: java.io.IOException: CreateProcess error=2, Impossibile trovare il file specificato at java.lang.ProcessImpl.create(Native Method) at java.lang.ProcessImpl.(ProcessImpl.java:189) at java.lang.ProcessImpl.start(ProcessImpl.java:133) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1021) ... 3 more Java Result: 1 BUILD SUCCESSFUL (total time: 7 seconds)

i put the file in the rigth directory "resources/tools/gdep/", but the program cannot see that...i'm using netBean on windows platform.

it's strange because the models can be loaded , but this file cannot, and both are in the same subpath "resources/"

Can anyone help me? thankyou!

davidcampos commented 11 years ago

Hi, unfortunately the integration of Gimli with GDep does not support Windows yet, since we need to use a BAT file. Such integration is scheduled for the next minor release. I will give you feedback when such integration is implemented.

Best regards, David Campos

Blashyrkh commented 11 years ago

thanks for your answer!

I am doing a project for university, so i can't wait for windows support! is there another way to pre-process in windows environment? I need to pre process a raw text document (like the one in the example) and do tokenization , sentence splitting, pos, lemma and NER in windows. is there a way to do so? any advice or code examples would be very much appreciated! thank you again!

davidcampos commented 11 years ago

Hi, I have already started implementing the integration with Windows. However, I'm having some issues that I hope I can solve soon. I'm going to do that as soon as possible. I will update the repository when such integration is available. Code examples and documentation are available at http://bioinformatics.ua.pt/support/gimli/doc/index.html. Please tell me if you need any further help.

Best regards, David Campos

Blashyrkh commented 11 years ago

thanks, i did read the documentation page...and from what i understood, there is no way to parse a raw text without that Gdep. am I wrong? the corpus of the other examples are already tokenized, and i cannot use, for example, this code with a corpus composed of raw text:

String corpus = "corpus.gz"; String[] model = {"crf1.gz", "crf2.gz"}; String[] features = {"bc1.config", "bc2.config"} Parsing[] parsing = {Parsing.FW, Parsing.BW}; EntityType entity = EntityType.protein; String output = "output.txt";

// Load model configurations ModelConfig[] mc = new ModelConfig[features.length]; for (int i = 0; i < features.length; i++) { mc[i] = new ModelConfig(features[i]); }

// Load Corpus Corpus c = null; try { c = new Corpus(LabelFormat.BIO, entity, corpus); } catch (GimliException ex) { logger.error("There was a problem loading the corpus", ex); return; }

// Load Models CRFModel[] crfmodels = new CRFModel[models.length]; try { for (int i = 0; i < models.length; i++) { crfmodels[i] = new CRFModel(mc[i], parsing[i], models[i]); } } catch (GimliException ex) { logger.error("There was a problem loading the model(s)", ex); return; }

// Annotate corpus Annotator a = new Annotator(c); a.annotate(crfmodels);

// Post-processing Parentheses.processRemoving(c); Abbreviation.process(c);

davidcampos commented 11 years ago

If you do not want to perform tokenization, you must implement a new Reader, in order to instantiate GDep with the option to do not perform tokenization (i.e., just white space tokenization) and adapt the reader to your specific input format. For instance, you can look at the JNLPBAReader code, which performs just white space tokenization. After having your reader, you can load the models and annotate the corpus just like you wrote in your code. Cheers,

Blashyrkh commented 11 years ago

ok thank you very much.... just for information...how much time would you need more or less to solve the gdep problem in windows?

Best Regards Daniele

davidcampos commented 11 years ago

Hi Daniele, good news! I have just solved the problems, the release supporting Windows will be out soon!

Best Regards, David

Blashyrkh commented 11 years ago

Great job David! tell me as soon as you update the tools on the web page! Thnaks!

Best Regards Daniele

davidcampos commented 11 years ago

Hi Daniele, I just release the version 1.0.2 of Gimli, which supports running GDep on Windows. Please update the java library or maven dependency accordingly. You do not need to update the GDep tool. Please tell me if you need any further help.

Best regards, David

Blashyrkh commented 11 years ago

thanks!! i'll try it as soon as possible!

just another question: is it possible to integrate an external gazetteer(thesaurus) for the NER section? I need to add some entity names from external sources

Thanks

davidcampos commented 11 years ago

Hi Daniele, you can do it in two different ways:

Using MALLET's dictionary matching feature generator is faster, but Gimli's approach add names variants on demand. If you decide to use MALLET's approach, remember to remove Gimli's dictionary matching from your reader.

Cheers, David

Blashyrkh commented 11 years ago

Sorry but i am a noob and i can't understand well.... I have a file with entity names that i want to add to the NER of gimli... what should I do? Please tell me step by step Thank you

Best Regards Daniele

davidcampos commented 11 years ago

Hi Daniele, Sorry but I'm not understanding what you really want. Do you want to use dictionary matching as features of the ML model, or use the dictionary to recognize one entity type and a ML model for other?

Cheers, David

Blashyrkh commented 11 years ago

Hello David, sorry for not being clear, but as i told you i am a newbie, so maybe i am not able to use the right terminology (plus I am italian and english isn't my first language). what i need is to add a file with a list of other entities that have to be recognized in the ner process...

I used the "on-demand annotation of raw sentences" example..and from what i've seen...it annotates the tokens with the prge Lexicon (using only letters I B O in the annotation, without the suffix DNA, Protein and others). I want to add to this example a file to recognized some entities that are not recognized by PRGE, for example other DNA names or totally different Entities. Have I been clear now? Sorry for making so many questions!

PS: Gdep works flawlessly now on windows, thank you very much ;)

davidcampos commented 11 years ago

Hi Daniele, Gimli was developed thinking on ML-based NER, with all its features focused on obtaining the best results as possible. If you want to combine names of various biomedical concepts recognized with ML-based approaches, you can do it automatically using Gimli, please look at the code for the JNLPA corpus that combines various models for various biomedical concepts. Based on that code, and using the provided dictionary matching features, you can also combine concepts from ML and dictionaries. However, for such purpose, I would like to recommend you another project of mine. Neji (http://bioinformatics.ua.pt/neji/) is a framework and tool to automatically extract dozens of heterogeneous biomedical concepts using the most appropriate and optimized techniques. Thus, you can use both ML models and dictionary matching at the same time. Please go to the website to better understand Neji's advantages and features. If you want to use it, please tell me.

Hope I have helped. Cheers, David

Blashyrkh commented 11 years ago

Hi David, i think I'm starting to understand....the ML models are these one: String model = "resources/models/gimli/bc2gm_bw_o2.gz";

so I need a model to make the new crfModel used in the annotate(crfModel) function, right?

then, I need to create a mallet dictionary with the txt file I have with all the entity names, in order to create a CRF file to use with gimli and do the annotation, is it correct?

---talking about dictionary matching, i made this:

System.out.println("Start Dictionary Matching");    
InputStream stopwords = new FileInputStream("stopwords.txt");
InputStream dictionary = new FileInputStream("dictionary.txt");
DictionaryType type = DictionaryType.PRGE  ;
boolean withVariations = false;

DictionaryMatcher dc = new DictionaryMatcher(stopwords, dictionary, type, withVariations);
dc.match(corpus);
System.out.println("End of Dictionary Matching");

this should do the dictionary matching...but what's exactly this dictionary matchig?? i put a dictionary (dog, cat) and stopwords file(is, not) and give "the cat is not a dog, damn" sentence the output was:

the LEMMA=the POS=DT CHUNK=B-NP NMOD_OF=cat O cat LEMMA=cat POS=NN CHUNK=I-NP SUB=be NMOD_BY=the LEXICON=CONCEPT O is LEMMA=be POS=VBZ CHUNK=B-VP O not LEMMA=not POS=RB CHUNK=O O a LEMMA=a POS=DT CHUNK=B-NP NMOD_OF=dog O dog LEMMA=dog POS=NN CHUNK=I-NP NMOD_OF=damn NMOD_BY=a LEXICON=PRGE LEXICON=CONCEPT O , LEMMA=, POS=, CHUNK=O O damn LEMMA=damn POS=NN CHUNK=B-NP NMOD_BY=dog O

dog and cat was not tagged with B or I letters, but just with the "LEXICON=" tag...is it right? how can I annotate these names as ANIMALS?is there a way to do it? this was just an example.

---About Neji, i'll take a look and talk to my teacher about it...then I'll let you know if I'll be allowed to use it...thanks!

Best Regards, Daniele

Blashyrkh commented 11 years ago

In the meanwhile I made a mod to the DictionaryMatcher class.... I rewrote it to put in the "LEXICON=" tag the name of the dictionary so now it displays the name of the dictionary from which the entity as been recognized but i am not able to activate the IOB tag! how can I do that when I do the dictionary matching only?

davidcampos commented 11 years ago

Hi Daniele, that dictionary matching utility is used to provide its output as features of the ML model. It was not implemented to provide the resulting annotations as output of Gimli. If you want to combine a ML model with dictionary matching, or use only dictionary matching, I suggest that you use Neji (http://bioinformatics.ua.pt/neji).

Best regards, David

Blashyrkh commented 11 years ago

Hi David, gimli is fine for what i need to do,so I don't think i'm going to use neji, thanks.

Just another question: i'm using gimli with external raw text, but i see that the parser is very very slow. i need to give hundreds of abstracts as input to the parser, doing one abstract at time, but it takes minutes just for one of them...it's too slow for me... is there a way to speed up the process? i don't have a slow PC, i5 ulv processor and 4gb of ram

this is the code, i use an iteration to take each abstract and do the sentence.parse(). i need to scan one abstract per time:

while (rs.next()) {

       Corpus corpus = new Corpus(format, entity);
       Sentence sentence = new Sentence(corpus);

    columnValue = rs.getString(1);
    logArea1.append(columnValue +"\n"); 

  if (!columnValue.isEmpty())   {
sentence.parse(parser, columnValue);
corpus.addSentence(sentence);

Annotator annotator = new Annotator(corpus); // Annotate corpus
annotator.annotate(crfModel);
Parentheses.processRemoving(corpus);// Post-process removing annotations with odd number of parentheses
Abbreviation.process(corpus); // Post-process by adding abbreviation annotations

 logArea1.append(sentence.toExportFormat());  }

                            }
davidcampos commented 11 years ago

Hi Daniele, GDep on windows is a bit slower than the version of linux and mac, since it does not use hash maps, a problem related with C and associated dependency libraries. Another way to speedup a bit processing speed is to change the way dictionary matching is performed. By changing the method "sentence.parse(parser, text)", you can remove the dictionary matching algorithm. However, remember to add that same matching in the model pipeline, maintaining the same feature format.

Best regards, David

Blashyrkh commented 11 years ago

Hi David, I solved the issue using linux... I have another question for you, that doesn't concern gimli. I need to use the gimli annotated documents to create a PASBio pattern to fill templates in the form |verb|arg1|arg2|arg3 etc. do you know any program capable of doing that? possibly in java with apis thank you very much

Best Regards, Dan

davidcampos commented 11 years ago

@Blashyrkh Since Gimli uses GDep, which performs chunking and dependency parsing, by accessing the corpus structure, you can infer PASs. However, if you prefer a library that already does almost everything, you probably should explore biomedical event extraction solutions, such as Turku, if I understood the problem correctly. I hope I have helped. Best regards, David

amelsheikh commented 11 years ago

Hello All,

I am trying to find a good example on how to use Gimli to perform biomedical text mining on some external text file. Any help is really appreciated.

Thanks a lot in advance

Blashyrkh commented 11 years ago

Thanks David!!

@amelsheikh I used the examples in the tutorial on the website

davidcampos commented 11 years ago

@amelsheikh Code examples are available at http://bioinformatics.ua.pt/support/gimli/doc/index.html.

TimmyStorms commented 10 years ago

Sorry to open this old issue, but I'm having the exact same issue as described in the initial post. I've tried to run neji from the root folder with: java -Xmx1G -Dfile.encoding=UTF-8 -cp target/neji-1.0-SNAPSHOT-jar-with-dependencies.jar pt.ua.tm.neji.cli.Main -i example/corpus/in / -if RAW -o example/corpus/out/ -of NEJI -d example/dictionaries/ -m example/models/ -t 4. I'm on Windows 7 and I'm using the latest Neji snapshot version.

davidcampos commented 10 years ago

@TimmyStorms Are you using Neji or Gimli distribution?

TimmyStorms commented 10 years ago

@davidcampos I had built a jar file from the source code provided in your neji distribution.

davidcampos commented 10 years ago

@TimmyStorms The problem is that the Neji disto does not contain the windows version of GDep. If required I can arrange a manner to generate this. Nevertheless, I would like to recommend you using Linux or Mas OS, since the Windows version implementation of GDep is considerably slower than the other ones, due to the absence of a C library on Windows.