dmis-lab / BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
http://bern2.korea.ac.kr
BSD 2-Clause "Simplified" License
175 stars 42 forks source link

nohup_multi_ner.out: list index out of range #54

Closed mjeensung closed 1 year ago

mjeensung commented 1 year ago

An input for the reproduction:

The invention claimed is*., 2 ,.**., 1 ,.* *., 1 ,.1. A fusion protein selected from the group consisting of SEQ ID NO. 148, SEQ ID NO. 150, SEQ ID NO. 194, SEQ ID NO. 196, SEQ ID NO. 198, SEQ ID NO. 200, SEQ ID NO. 202 and SEQ ID NO. 204.., 1 ,.*  *., 1 ,.2. A fusion protein of claim 1 wherein the signal peptide has been removed.., 1 ,.*  *., 1 ,.3. The fusion protein of SEQ ID NO. 148 wherein the signal peptide has been removed, SEQ ID NO. 177 ., 3 ,.QIQKAEQN DVKLAPPTDV RSGYIRLVKN VNYYIDSESI WVDNQEPQIV HFDAVVNLDK GLYVYPEPKR YARSVRQYKI LNCANYHLTQ VRTDFYDEFW GQGLRAAPKK QKKHTLSLTP DTTLYNAAQI ICANYGEAFS VDKKGGTKKA AVSELLQASA PYKADVELCV YSTNETTNCT GGKNGIAADI TTAKGYVKSV TTSNGAITVK GDGTLANMEY ILQATGNAAT GVTWTTTCKG TDASLFPANF CGSVTQ., 4 ,..., 1 ,.*  *., 1 ,.4. The fusion protein of SEQ ID NO. 194 wherein the signal peptide has been removed, SEQ ID NO. 219 ., 3 ,.IQKAEQND VKLAPPTDVR SGYIRLVKNV NYYIDSESIW VDNQEPQIVH FDAVVNLDKG LYVYPEPKRY ARSVRQYKIL NCANYHLTQV RTDFYDEFWG QGLRAAPKKQ KKHTLSLTPD TTLYNAAQII CANYGEAFSV DKKGGTKKAA VSELLQASAP YKADVELCVY STNETTNCTG GKNGIAADIT TAKGYVKSVT TSNGAITVKG DGTLANMEYI LQATGNAATG VTWTTTCKGT DASLFPANFC GSVTQ., 4 ,..., 1 ,.*  *., 1 ,.5. An immunogenic composition comprising the fusion protein of SEQ ID NO. 177 ., 3 ,.QIQKAEQN DVKLAPPTDV RSGYIRLVKN VNYYIDSESI WVDNQEPQIV HFDAVVNLDK GLYVYPEPKR YARSVRQYKI LNCANYHLTQ VRTDFYDEFW GQGLRAAPKK QKKHTLSLTP DTTLYNAAQI ICANYGEAFS VDKKGGTKKA AVSELLQASA PYKADVELCV YSTNETTNCT GGKNGIAADI TTAKGYVKSV TTSNGAITVK GDGTLANMEY ILQATGNAAT GVTWTTTCKG TDASLFPANF CGSVTQ., 4 ,..., 1 ,.*  *., 1 ,.6. An immunogenic composition comprising the fusion protein of SEQ ID NO.219 ., 3 ,.IQKAEQND VKLAPPTDVR SGYIRLVKNV NYYIDSESIW VDNQEPQIVH FDAVVNLDKG LYVYPEPKRY ARSVRQYKIL NCANYHLTQV RTDFYDEFWG QGLRAAPKKQ KKHTLSLTPD TTLYNAAQII CANYGEAFSV DKKGGTKKAA VSELLQASAP YKADVELCVY STNETTNCTG GKNGIAADIT TAKGYVKSVT TSNGAITVKG DGTLANMEYI LQATGNAATG VTWTTTCKGT DASLFPANFC GSVTQ., 4 ,..., 1 ,.* ., 1 ,.

The error message in logs/nohup_multi_ner.out:

Found an error: list index out of range
mjeensung commented 1 year ago

closed by https://github.com/dmis-lab/BERN2/commit/ce6c9824526372944efe36b9d2c14cbdda96cf70

minstar commented 1 year ago

This error was generated because of the NER detokenization issue given a sentence length is larger than the max sequence length (e.g., 128 in this case).

We use the truncated version in NER preprocessing (lines 302-306 in multi_ner/main.py) which could generate the following issue.

Thus, we resolve this issue by changing the code to the sliding window in preprocessing part (lines 308-414 in multi_ner/main.py) and postprocessing part (lines 236-238 in multi_ner/ops.py).

If there are any other problems please let me (@minstar) know and reopen this issue!