dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 25 forks source link

Etymology paragraph stripped of word hyperlinks #11

Closed chmeyer closed 9 years ago

chmeyer commented 9 years ago

Originally reported on Google Code with ID 11

What steps will reproduce the problem?

1. Output the etymological information for a word.

What is the expected output? What do you see instead?

I did not know what to expect, so I output the etymology information for all words
in the db. These are the first lines from the file created (most entries are like this):

English:dictionary::    , from , from , from , perfect past participle of + .
English:dictionary::    , from , from , from , perfect past participle of + .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German , Danish
.The verb comes from .
English:free::  From , from , , from , . Compare West Frisian , Dutch , German , Danish
.The verb comes from .

What version of the product are you using? On what operating system?

I am using jwktl-1.0.1 as a Maven artifact on a Ubuntu 12.04 machine, on Wiktionary
dump enwiktionary-20141004-pages-articles.xml

Please provide any additional information below.

The process runs smoothly, all other output information seems fine. This looks like
an "overly eager" clean-up of the paragraphs, since etymological information is given
in a slightly non-standard format. I am not sure if this format changed over time,
or the etymological information was always provided like this.

Reported by nastase@fbk.eu on 2015-01-26 11:28:43

chmeyer commented 9 years ago
Hi. If you access the etymologies using IWiktionaryEntry.getWordEtymology(), you will
obtain a IWikiString representation of the etymology. This class provides both a getPlainText()
and a getText() method to obtain a string representation of the etymology. I assume
that in your code you used the former (or an implicit toString(), which uses getPlainText(),
too). The latter, however, allows you to work with the full markup encoded in Wiktionary.
And yes: getPlainText() is too eager for etymology strings. I'm not sure if a plain
text representation is necessary at all if you have the markup version. I have been
experimenting with a EtymologyTemplateHandler for a while - you can find it in the
api.util.TemplateParser file - using this methodology, it should be possible to analyze
etymology strings. It's far from perfect, but probably a good starting point. If you
make interesting changes to the JWKTL source code, I'm happy to integrate it. Just
reopen this ticket or start a new one. Best wishes!

Reported by chmeyer.de on 2015-02-04 15:07:36

chmeyer commented 9 years ago
Thanks Christian! I wasn't planning to make changes, I was just trying to
see how much of the info in WIktionary is formalized. I am allergic to
databases :)

all the best,
Vivi

Reported by nastase@fbk.eu on 2015-02-04 15:16:12