dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 25 forks source link

bugfix/#78- Remove `File:` comments, not just `Image:` comments. #79

Open benldr opened 3 years ago

benldr commented 3 years ago

This solves issue #78 (I have re-parsed a Wiktionary dump using the updated code and no longer get the issues explained in #78).

I am assuming that [[File mark-up on a Wiktionary page is not used by the jwktl parser anywhere (I am not very familiar with the bulk of jwktl) - obviously if it is used elsewhere then my code change should not be approved!

Apologies if I have not followed the correct convention for contributing to the project- I am new to this. If so, feel free to delete my pull request and make the edit yourself.

jberkel commented 2 years ago

Hello,

you're right, the parser should remove File: tags as well. However I've just noticed another problem with the image removal, it fails in cases the image has nested links, for example

[[File:foo.png|thumb|Bla bla [[foo]]  bar]]

Currently, this results in bar]].