encoding issue in trigrams - different serialization in meta?

ashokpant / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc

Other

0 stars 0 forks source link

encoding issue in trigrams - different serialization in meta? #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Use the demo dataset from the workshop
2. Use triple feature extractor, or ngram or posngram FE
3. View the arff in Weka explorer

What is the expected output? What do you see instead?

Slashes, dashes and such are incorporated into feature names as text 
"u44","u45" and such. Not sure if that impacts the feature values or just 
feature names = to be explored if it comes from the ser file or just from 
feature extraction.
This problem appears already in the Friday runs (using textreader on the 
sentiment experiment), so it's not impacted by today's changes. 

Please use labels and text to provide additional information.

Original issue reported on code.google.com by l.flek...@gmail.com on 15 Jul 2013 at 6:10

GoogleCodeExporter commented 9 years ago

Dataset-independent issue. See screenshot for Twentynewsgroup.

Original comment by l.flek...@gmail.com on 16 Jul 2013 at 10:02

Attachments:

encodingissue.png

GoogleCodeExporter commented 9 years ago

This was necessary due to some encoding bugs in ClearTk. It should not impair 
the results, but it should also not be necessary any more, since we can just 
write correct UTF8 now.

Original comment by oliver.ferschke on 16 Jul 2013 at 10:09

GoogleCodeExporter commented 9 years ago

Only feature names are impacted, measured values are correct. Stopword lists 
ignore escaped characters?

Original comment by l.flek...@gmail.com on 16 Jul 2013 at 10:11

GoogleCodeExporter commented 9 years ago

Stopword lists don't work on the ARFF. They work on the CAS, as far as I 
remember. So they should not be affected. 
The conversion is done when the ARFF is written.

Original comment by oliver.ferschke on 16 Jul 2013 at 10:19

GoogleCodeExporter commented 9 years ago

Ok, good then

Original comment by l.flek...@gmail.com on 16 Jul 2013 at 10:29

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

This is a feature not a bug :)

ARFF files seem to be sensitive to certain special characters and don't work 
anymore.
As a workaround, I replace all special characters with their code number, 
except for some that are known to be safe.

Original comment by torsten....@gmail.com on 18 Jul 2013 at 7:10

GoogleCodeExporter commented 9 years ago

I think this is clarified now. Please re-open if problems should arise.

Original comment by torsten....@gmail.com on 6 Sep 2013 at 11:07

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

Suggestion from Richard: use URLCoded.encode() and URLCoded.decode() instead of 
the "u" + c.codePointAt(0)

That way it's reversible.
Could someone please comment on whether ARFF files could handle urlencoding?

http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/
net/URLCodec.html

Original comment by EmilyKJa...@gmail.com on 31 Jan 2014 at 1:56

GoogleCodeExporter commented 9 years ago

I would say, just give it a try :)

Original comment by torsten....@gmail.com on 31 Jan 2014 at 2:33

GoogleCodeExporter commented 9 years ago

According to http://weka.wikispaces.com/ARFF+%28stable+version%29, ARFF files 
need to be all ASCII. Therefore, I guess, urlencoding works.

Original comment by daxenber...@gmail.com on 31 Jan 2014 at 2:51

GoogleCodeExporter commented 9 years ago

Well, some of the problematic characters were "/" and "-" which are also ASCII.
AFAIK urlencode will not change "-", so this probably doesn't solve the problem.

Original comment by torsten....@gmail.com on 31 Jan 2014 at 3:54