encoding issue in trigrams - different serialization in meta?

dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.

https://dkpro.github.io/dkpro-tc/

Other

34 stars 19 forks source link

encoding issue in trigrams - different serialization in meta? #32

Closed daxenberger closed 9 years ago

daxenberger commented 9 years ago

Originally reported on Google Code with ID 32

What steps will reproduce the problem?
1. Use the demo dataset from the workshop
2. Use triple feature extractor, or ngram or posngram FE
3. View the arff in Weka explorer

What is the expected output? What do you see instead?

Slashes, dashes and such are incorporated into feature names as text "u44","u45" and
such. Not sure if that impacts the feature values or just feature names = to be explored
if it comes from the ser file or just from feature extraction.
This problem appears already in the Friday runs (using textreader on the sentiment
experiment), so it's not impacted by today's changes. 

Please use labels and text to provide additional information.

Reported by l.flekova on 2013-07-15 18:10:00

daxenberger commented 9 years ago

Dataset-independent issue. See screenshot for Twentynewsgroup.

Reported by l.flekova on 2013-07-16 10:02:29

Attachment: encodingissue.png

daxenberger commented 9 years ago

This was necessary due to some encoding bugs in ClearTk. It should not impair the results,
but it should also not be necessary any more, since we can just write correct UTF8
now.

Reported by oliver.ferschke on 2013-07-16 10:09:17

daxenberger commented 9 years ago

Only feature names are impacted, measured values are correct. Stopword lists ignore
escaped characters?

Reported by l.flekova on 2013-07-16 10:11:43

daxenberger commented 9 years ago

Stopword lists don't work on the ARFF. They work on the CAS, as far as I remember. So
they should not be affected. 
The conversion is done when the ARFF is written.

Reported by oliver.ferschke on 2013-07-16 10:19:00

daxenberger commented 9 years ago

Ok, good then

Reported by l.flekova on 2013-07-16 10:29:03

Labels added: Priority-Low
Labels removed: Priority-Medium

daxenberger commented 9 years ago

This is a feature not a bug :)

ARFF files seem to be sensitive to certain special characters and don't work anymore.
As a workaround, I replace all special characters with their code number, except for
some that are known to be safe.

Reported by torsten.zesch on 2013-07-18 07:10:40

daxenberger commented 9 years ago

I think this is clarified now. Please re-open if problems should arise.

Reported by torsten.zesch on 2013-09-06 11:07:14

Status changed: Invalid

daxenberger commented 9 years ago

Suggestion from Richard: use URLCoded.encode() and URLCoded.decode() instead of the
"u" + c.codePointAt(0)

That way it's reversible.
Could someone please comment on whether ARFF files could handle urlencoding?

http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/URLCodec.html

Reported by EmilyKJamison on 2014-01-31 13:56:07

daxenberger commented 9 years ago

I would say, just give it a try :)

Reported by torsten.zesch on 2014-01-31 14:33:16

daxenberger commented 9 years ago

According to http://weka.wikispaces.com/ARFF+%28stable+version%29, ARFF files need to
be all ASCII. Therefore, I guess, urlencoding works.

Reported by daxenberger.j on 2014-01-31 14:51:50

daxenberger commented 9 years ago

Well, some of the problematic characters were "/" and "-" which are also ASCII.
AFAIK urlencode will not change "-", so this probably doesn't solve the problem.

Reported by torsten.zesch on 2014-01-31 15:54:21