bioinformatics-ua / gimli

Gimli is now part of Neji.
https://github.com/BMDSoftware/neji
14 stars 6 forks source link

NullPointerException on `annotate JNLPBA` w/o document IDs in data #3

Open spyysalo opened 11 years ago

spyysalo commented 11 years ago

When running

./gimli.sh annotate JNLPBA -c JNLPBA-test.gz -o output.txt -m protein.gz,protein,fw,jnlpba_protein.config

after creating JNLPBA-test.gz and protein.gz from the JNLPBA data distribution files Genia4ERtask1.iob2 and Genia4EReval1.raw (resp.), gimli 1.0.2 crashes on

Exception in thread "main" java.lang.NullPointerException
    at pt.ua.tm.gimli.writer.JNLPBAWriter.write(JNLPBAWriter.java:92)
    at pt.ua.tm.gimli.writer.JNLPBAWriter.main(JNLPBAWriter.java:409)

i.e. last line of

            for (int i = 0; i < corpus.size(); i++) {
                s = corpus.getSentence(i);
                medline = s.getId();

                if (!medline.equals(lastmedline)) {

If Genia4ERtask2.iob2 and Genia4EReval2.raw are used instead of Genia4ERtask1.iob2 and Genia4EReval1.raw, the system works as expected, further indicating that the issue is the absence of the ID lines of the form ###MEDLINE:95385995. (didn't check the effect of switching out just one.)

At a minimum, the system should fail gracefully if medline IDs are not included, and preferably work normally without them. This is a valid variant of the JNLPBA format, and comparable ids are not available for all inputs.

davidcampos commented 11 years ago

Hi Sampo, thanks for your help. You are correct, Gimli fails because it does not find the MEDLINE identifier. However, it should not fail when no MEDLINE identifier is provided, since that format is also valid.

That problem will be corrected on the next version of Gimli (milestone 1.0.3).

Best regards, David

spyysalo commented 11 years ago

@davidcampos: Good to know! If I may make an additional suggestion, it would be helpful for applying the tagger to resources other than PubMed if the matching of ID lines was relaxed from requiring "MEDLINE" to something more generic, like ^###[^ ]+$, to allow e.g. ###PMC:1234567-sec-01-Introduction.

davidcampos commented 11 years ago

@spyysalo Thanks for the suggestion. Gimli will allow that on next version.

jasonsu123 commented 8 years ago

Hello, I also have the similar problems in running Gimli. I use Cygdrive of Windows 7 to provide Linux-like environment. Running the queries mvn -v and javac -versionare normal. Follow the tutorials of http://bioinformatics.ua.pt/support/gimli/doc/index.html When I input below query ./gimli.sh convert JNLPBA -c Genia4ERtask2.iob2 -e protein -g gdep.gz -o corpus.gz It output the error that can not find or import pt.ua.tm.gimli.reader.JNLPBAReader

What is the problems?

Thank you