dkpro / dkpro-uby

Framework for creating and accessing UBY resources – sense-linked lexical resources in standard UBY-LMF format
https://dkpro.github.io/dkpro-uby
Other
22 stars 3 forks source link

OmegaWiki: Integrate additional POS tags / remove multibyte chars / region labels / sense index #109

Open judithek opened 9 years ago

judithek commented 9 years ago
Tried the conversion of a newer OmegaWiki dump.

1. There are UTF16-multibyte characters which cannot be handled by the XML components
we use. Remove them for now.

2. There are additional part of speech tags that are currently not part of the POS
mapping (e.g., for the lemma "but"). Add more mapping entries.

3. The converter creates semantic labels of type "regionofUsage". While this
often contains information on the diatopic variety, it is a free text field and also
includes other label types and longer explanations that are not corresponding to our
definition of label. Examples are:
* http://www.omegawiki.org/Expression:Mietze - label: verniedlichend
* http://www.omegawiki.org/Expression:usw. - label: Vor usw. steht in Aufzählungen
kein Komma. Es heißt also nicht "Bananen, Äpfel, usw.", sondern "Bananen,
Äpfel usw."
** cf. http://uby:8080/uby-browser/entry/OW_deu_LexicalEntry_24644

4. Currently Sense.index contains internal OmegaWiki IDs which should be encoded in
MonolingualExtRef. The index should be a running number corresponding to the natural
sense order defined by the Lexicon.

Original issue reported on code.google.com by chmeyer.de on 2014-10-09 09:01:03

judithek commented 9 years ago
I can take care of some or all of these issues. Please let me know if you already started
working on any of these.

Original issue reported on code.google.com by matu011235 on 2014-10-09 11:28:27

judithek commented 9 years ago
Committed my changes. Feel free to review. Regarding the labels, it might be interesting
to manually classify them (possibly restrict a selection by lengths and/or frequency)
- there are valuable, but untyped semantic labels hidden in the annotations. Since
this sounds like labor-intensive work, I'll leave it for future work ;-)

Original issue reported on code.google.com by chmeyer.de on 2014-10-09 13:12:38

judithek commented 9 years ago
Changes look good, I agree that handling the labels can be delay for now. However, I'm
not quite happy wiht excluding UTF-16 characters - according to the XML specification,
any XML processor should be able to handle that: http://www.w3.org/TR/xml/#charsets
Maybe we can look into that again later on, it's not an urgent issue I guess.

Original issue reported on code.google.com by matu011235 on 2014-10-10 05:41:06

judithek commented 9 years ago
Agree. For clarification: Not all UTF16 characters are removed, but values that contain
a UTF16 multibyte character (i.e., a character requiring 32 bit for display). I assume
that some UTF8-UTF16-UTF8 conversion is missing in the process reading from DB - processing
in Java - writing to XML file. Should be looked into. So far, I removed the values
as the converter fails with exception otherwise.

Original issue reported on code.google.com by chmeyer.de on 2014-10-10 07:35:29

judithek commented 9 years ago
unfortunately, changing this:
4. Currently Sense.index contains internal OmegaWiki IDs which should be encoded in
MonolingualExtRef. The index should be a running number corresponding to the natural
sense order defined by the Lexicon.

broke the import classes where OW alignments are imported, e.g.
OmegaWikiCrossLingualAlignment and

@Michael:
would it be much effort to rewrite the problematic line? and could you do that?
otherwise the import can not continue

List<Sense> first = ubySource.getSensesByOWSynTransId(""+source.getSyntransid());

getSensesByOWSynTransId not does not work any more, because the index attribute no
longer contains the required value

Original issue reported on code.google.com by eckle.kohler on 2014-10-16 09:10:33

judithek commented 9 years ago
I don't know how much effort it is, by I will look into it.

Original issue reported on code.google.com by matu011235 on 2014-10-16 10:51:40

judithek commented 9 years ago
I committed the changes. Please check and close the bug if the issue is resolved.

Original issue reported on code.google.com by matu011235 on 2014-10-16 11:16:03

judithek commented 9 years ago
thanks! I will check it tomorrow (first item on the agenda ;)

Original issue reported on code.google.com by eckle.kohler on 2014-10-16 18:28:33

judithek commented 9 years ago
updated OmegaWikiCrossLingualAlignment to new externalSystem value

Original issue reported on code.google.com by eckle.kohler on 2014-10-17 07:49:08

judithek commented 9 years ago
is fixed for the "lite" import
changes might still be necessary for the import of Wikipedia - OW alignments

Original issue reported on code.google.com by eckle.kohler on 2014-10-17 11:41:37

judithek commented 9 years ago
I updated the "medium import" as well - new externalSystem Value in OmegaWikiWiktionaryAlignment

Original issue reported on code.google.com by eckle.kohler on 2014-10-20 13:30:49

judithek commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by eckle.kohler on 2014-11-07 09:29:49

judithek commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2015-02-18 21:11:45

judithek commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by chmeyer.de on 2015-04-10 08:57:50