character encoding problems

NikhilMIT2013 / java-libpst

Automatically exported from code.google.com/p/java-libpst

0 stars 0 forks source link

character encoding problems #1

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. open any item containing non basic ascii chars

What is the expected output? What do you see instead?
I see weird special chars instead of umlauts (for example).

What version of the product are you using? On what operating system?
java-libpst 0.2 with java 1.6.0_18 on Linux

Please provide any additional information below.
Is there any way of solving this issue? I tried to modify getStringValue in
PSTTableItem to be able to read the raw bytes with the cp1252 encoding, but
unfortunately the byte representation is really strange.

Original issue reported on code.google.com by Hannes.K...@gmail.com on 18 Mar 2010 at 5:00

GoogleCodeExporter commented 8 years ago

While looking at the raw data output I noticed that I only have to ignore every
second byte in order to get a latin1 or cp1252 encoded string. At least with the
pst-file I#m currently testing this it works.

In PSTTableItem I replaced the if case "(stringType == VALUE_TYPE_PT_UNICODE)" 
with:

// treat it like a cp1252 encoded text seperated by 0-chars
ByteArrayOutputStream baos = new ByteArrayOutputStream(this.data.length / 2);
for (int x = 0; x < this.data.length - 1; x = x + 2) {
    baos.write(this.data[x]);
}
if (this.data.length % 2 == 1) {
    baos.write(this.data[this.data.length - 1]);
}
baos.close();
String s = new String(baos.toByteArray(), Charset.forName("cp1252"));
outputBuffer.append(s);

I guess this only works, if the emails are encoded this way, right? If there is 
a
UTF-8 encoded email it could possibly break the special chars...

Original comment by Hannes.K...@gmail.com on 19 Mar 2010 at 11:31

GoogleCodeExporter commented 8 years ago

Hi need help in searching the message based on subject, name and sent time using
Java-libpst api

Original comment by ram...@gmail.com on 30 Apr 2010 at 9:00

GoogleCodeExporter commented 8 years ago

PSTTableItem doesn't read Unicode strings correctly.  Replace "if (stringType 
== 
VALUE_TYPE_PT_UNICODE) { .... } with the code below to fix this problem.

        if (stringType == VALUE_TYPE_PT_UNICODE) {
            // we are a nice little-endian unicode string.
            // New code - use String class built-in decoding of little-
endian unicode.
            try {
                return new String(data, "UTF-16LE");
            } catch (UnsupportedEncodingException e) {
                System.out.println("Error decoding string: " + 
data.toString());
                return "";
            }
                }

The original code failed to AND each byte with 0xFF when converting them to 
unicode 
chars.

Orin.

Original comment by orin.e...@gmail.com on 20 May 2010 at 11:36

GoogleCodeExporter commented 8 years ago

sweet, thanks orin, being an arrogant aussie I did't have much in the way of 
muti byte  messages to test with.

I've added this code to SVN and created a new release

Original comment by rjohnson...@gmail.com on 18 Jun 2010 at 3:22

GoogleCodeExporter commented 8 years ago

marking as fix-ed

Original comment by rjohnson...@gmail.com on 18 Jun 2010 at 3:25

Changed state: Fixed