cboulanger / excite-docker

Docker image with tools for the annotation of ML training docs for reference extraction based on the EXparser tools
https://cboulanger.github.io/excite-docker
GNU General Public License v3.0
0 stars 0 forks source link

IndexError: string index out of range during segmentation #8

Open cboulanger opened 2 years ago

cboulanger commented 2 years ago
 File "/app/run-main.py", line 174, in <module>
    call_segmentation_training(sys.argv[2])
  File "/app/run-main.py", line 125, in call_segmentation_training
    os.path.join(model_dir, get_version(), model_name))
  File "/app/EXparser/Training_Seg.py", line 55, in train_segmentation
    train_feat[len(train_feat) - 1].extend([word2feat(a, stopw, 2, len(ln), b1, b2, b3, b4, b5, b6)])
  File "/app/EXparser/src/gle_fun_seg.py", line 378, in word2feat
    feat.update(get_last(w))
  File "/app/EXparser/src/gle_fun_seg.py", line 281, in get_last
    c = w[-1] * 2
IndexError: string index out of range
cboulanger commented 2 years ago

Added a try/except to work around this issue. It shows that the bug is caused by malformed annotations (see below). The fix simply ignores the malformed lines, which might be the only appropriate solution.

Segmentation training [###.............................] 35/320: 0:00:24 remaining...
16563.xml: problem parsing <author><surname>Weber <author><given-names>Max </surname></author></given-names></author>(</author><year>1988</year><author>c/ Orig. </author><year>1920</ye
ar><author>) <title>Gesammelte Aufsätze zur Religionssoziologie I</author>. <other>Tübingen</title>.</other>
Segmentation training [#######.........................] 71/320: 0:00:23 remaining...
20786.xml: problem parsing <author><surname>Schnell</surname>,<given-names> R.</given-names></author>, <year>1997</year>: <title>Nonresponse in Bevölkerungsumfragen. Ausmaß, Entwicklun
g und Ursachen</title>. <other>Opladen<other>: <publisher>Leske + Budrich.</publisher></other></other>
Segmentation training [#######.........................] 77/320: 0:00:14 remaining...
21690.xml: problem parsing <source>Working Brief</source> <volume>15</volume>: <author><given-names>Diego</given-names> <surname>Compagna / <author><given-names>Stefan</surname> <surna
me>Derpmann</surname></author></given-names></author> / <author><given-names>Kathrin</given-names> <surname>Mauz</surname></author> / <author><given-names>Karen</given-names> <surname>
Shire</surname></author> (<year>2009</year>): <title>Förderung des Wissenstransfers für eine aktive Mitgestaltung des Pflegesektors durch Mikrosystemtechnik (WiMi-Care)</title>, <sourc
e>Working Brief</source> <volume>15</volume>: <title>Die Einstellung von Pflegekräften gegenüber technischen Neuerungen</title>. In: <url>http://www.wimi-care.de/outputs.html#Briefs</u
rl> (letzter Abruf: <other>02.12.2009</other>).
Segmentation training [##################..............] 188/320: 0:00:13 remaining...
36684.xml: problem parsing <title>Stellungnahmen geladener Sachverständiger vor dem Bundestag zum Thema Fiskalpakt und ESM</title>, <other>7.5.</other><year>2012</year>: <url><www. bun
destag.de/bundestag/ausschuesse17/a08/anhoerungen/fiskalpakt_und_esm/stellungnahmen/index.html/></url>.
Segmentation training [######################..........] 225/320: 0:00:10 remaining...
40723.xml: problem parsing <author><surname>Koskinas</surname></author>, <author><given-names>Ioannis </given-names></author>(<year>2014</year>),<title> The Only Choice Left for Afghan
istan</title>, online: <url>htp://southasia.foreign-policy.com/posts/2014/09/11/the_only_choice_ left_for_afghanistan></url> (<other>27 October 2014</other>).
Segmentation training [##########################......] 260/320: 0:00:05 remaining...
45841.xml: problem parsing <editor>Folha Online</editor> (<year>2012</year>), <url><www1.folha.uol.com.br/fsp/brasil/></url> (<other>12. November 2012</other>).
45841.xml: problem parsing <author><surname>Patarra</surname>, <given-names>Ivo</given-names></author> (<year>2010</year>), <title>O chefe</title>, online: <url><www.escandalodomensala
o.com.br></url> (<other>2. November 2012</other>).
45841.xml: problem parsing <editor>Veja</editor> (<year>2012</year>), <title>O Julgamento do Mensalão. A hora da Sentença</title>, online: <url><htp://veja.abril.com.br/o-jul - gamento
-do-mensalao/hora-da-sentenca/></url> (<other>13. November 2012</other>).