Closed GoogleCodeExporter closed 9 years ago
Dataset-independent issue. See screenshot for Twentynewsgroup.
Original comment by l.flek...@gmail.com
on 16 Jul 2013 at 10:02
Attachments:
This was necessary due to some encoding bugs in ClearTk. It should not impair
the results, but it should also not be necessary any more, since we can just
write correct UTF8 now.
Original comment by oliver.ferschke
on 16 Jul 2013 at 10:09
Only feature names are impacted, measured values are correct. Stopword lists
ignore escaped characters?
Original comment by l.flek...@gmail.com
on 16 Jul 2013 at 10:11
Stopword lists don't work on the ARFF. They work on the CAS, as far as I
remember. So they should not be affected.
The conversion is done when the ARFF is written.
Original comment by oliver.ferschke
on 16 Jul 2013 at 10:19
Ok, good then
Original comment by l.flek...@gmail.com
on 16 Jul 2013 at 10:29
This is a feature not a bug :)
ARFF files seem to be sensitive to certain special characters and don't work
anymore.
As a workaround, I replace all special characters with their code number,
except for some that are known to be safe.
Original comment by torsten....@gmail.com
on 18 Jul 2013 at 7:10
I think this is clarified now. Please re-open if problems should arise.
Original comment by torsten....@gmail.com
on 6 Sep 2013 at 11:07
Suggestion from Richard: use URLCoded.encode() and URLCoded.decode() instead of
the "u" + c.codePointAt(0)
That way it's reversible.
Could someone please comment on whether ARFF files could handle urlencoding?
http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/
net/URLCodec.html
Original comment by EmilyKJa...@gmail.com
on 31 Jan 2014 at 1:56
I would say, just give it a try :)
Original comment by torsten....@gmail.com
on 31 Jan 2014 at 2:33
According to http://weka.wikispaces.com/ARFF+%28stable+version%29, ARFF files
need to be all ASCII. Therefore, I guess, urlencoding works.
Original comment by daxenber...@gmail.com
on 31 Jan 2014 at 2:51
Well, some of the problematic characters were "/" and "-" which are also ASCII.
AFAIK urlencode will not change "-", so this probably doesn't solve the problem.
Original comment by torsten....@gmail.com
on 31 Jan 2014 at 3:54
Original issue reported on code.google.com by
l.flek...@gmail.com
on 15 Jul 2013 at 6:10