Closed daxenberger closed 9 years ago
Dataset-independent issue. See screenshot for Twentynewsgroup.
Reported by l.flekova
on 2013-07-16 10:02:29
This was necessary due to some encoding bugs in ClearTk. It should not impair the results,
but it should also not be necessary any more, since we can just write correct UTF8
now.
Reported by oliver.ferschke
on 2013-07-16 10:09:17
Only feature names are impacted, measured values are correct. Stopword lists ignore
escaped characters?
Reported by l.flekova
on 2013-07-16 10:11:43
Stopword lists don't work on the ARFF. They work on the CAS, as far as I remember. So
they should not be affected.
The conversion is done when the ARFF is written.
Reported by oliver.ferschke
on 2013-07-16 10:19:00
Ok, good then
Reported by l.flekova
on 2013-07-16 10:29:03
This is a feature not a bug :)
ARFF files seem to be sensitive to certain special characters and don't work anymore.
As a workaround, I replace all special characters with their code number, except for
some that are known to be safe.
Reported by torsten.zesch
on 2013-07-18 07:10:40
I think this is clarified now. Please re-open if problems should arise.
Reported by torsten.zesch
on 2013-09-06 11:07:14
Invalid
Suggestion from Richard: use URLCoded.encode() and URLCoded.decode() instead of the
"u" + c.codePointAt(0)
That way it's reversible.
Could someone please comment on whether ARFF files could handle urlencoding?
http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/URLCodec.html
Reported by EmilyKJamison
on 2014-01-31 13:56:07
I would say, just give it a try :)
Reported by torsten.zesch
on 2014-01-31 14:33:16
According to http://weka.wikispaces.com/ARFF+%28stable+version%29, ARFF files need to
be all ASCII. Therefore, I guess, urlencoding works.
Reported by daxenberger.j
on 2014-01-31 14:51:50
Well, some of the problematic characters were "/" and "-" which are also ASCII.
AFAIK urlencode will not change "-", so this probably doesn't solve the problem.
Reported by torsten.zesch
on 2014-01-31 15:54:21
Originally reported on Google Code with ID 32
Reported by
l.flekova
on 2013-07-15 18:10:00