RNA NER - Githubissues

juanmirocks commented 9 years ago

[x] Train model with GENIA corpus
[x] Test performance (with test set or otherwise CV)
[ ] 2x confirm that training code for GnormPlus is not available
[ ] Try training with Moara

ashishbaghudana commented 9 years ago

Converting a corpus with GIMLI is giving me a:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(Arrays.java:2694)
    at java.lang.String.<init>(String.java:203)
    at com.aliasi.chunk.ChunkingImpl.<init>(ChunkingImpl.java:73)
    at com.aliasi.dict.ExactDictionaryChunker.chunk(ExactDictionaryChunker.java:275)
    at com.aliasi.dict.ExactDictionaryChunker.chunk(ExactDictionaryChunker.java:251)
    at pt.ua.tm.gimli.dictionary.DictionaryMatcher.isStopword(DictionaryMatcher.java:146)
    at pt.ua.tm.gimli.dictionary.DictionaryMatcher.loadDictionaryChunker(DictionaryMatcher.java:121)
    at pt.ua.tm.gimli.dictionary.DictionaryMatcher.<init>(DictionaryMatcher.java:170)
    at pt.ua.tm.gimli.reader.JNLPBAReader.read(JNLPBAReader.java:142)
    at pt.ua.tm.gimli.reader.JNLPBAReader.main(JNLPBAReader.java:425)

I am currently giving 1024m of heap space on the cluster, and yet I get an out of memory error. Any ideas as to how I can rectify this?

juanmirocks commented 9 years ago

1024m = 1G is not that much either. Start with >4G

ashishbaghudana commented 9 years ago

[<RNA> Identification Performance]
# of OBJECTs: 118,   ANSWERs: 135.

# (recall / precision / f-score) of ...
FULLY CORRECT answer with class info: 65 (0.5508 / 0.4815 / 0.5138),
correct LEFT boundary with class info: 74 (0.6271 / 0.5481 / 0.5850),
correct RIGHT boundary with class info: 81 (0.6864 / 0.6000 / 0.6403),

ashishbaghudana commented 9 years ago

GNormPlus is already trained on their own corpus. It doesn't permit training on other corpora. Moreover, it distinguishes between gene names, gene families and protein domains.

The download itself is about 2.3GB. I can try it out, but okay to download?

ashishbaghudana commented 9 years ago

Updated RNA NER:

 [<RNA> Identification Performance]
 # of OBJECTs: 118,  ANSWERs: 115.

 # (recall / precision / f-score) of ...
 FULLY CORRECT answer with class info: 81 (0.6864 / 0.7043 / 0.6953),
 correct LEFT boundary with class info: 84 (0.7119 / 0.7304 / 0.7210),
 correct RIGHT boundary with class info: 89 (0.7542 / 0.7739 / 0.7639),

ashishbaghudana / mthesis-ashish

RNA NER #8