Open wjcarpenter opened 4 years ago
I've narrowed this down a little bit. It looks like the data is already UTF-8 encoded by the time DumpTransaction writes this entry:
[2020-03-15 12:51:19,099][pool-6-thread-1] DEBUG DumpTransaction - [DUMP] Begin Post Search Entry - Filter=(&(|(owner=--ELIDED--)(shareto=--ELIDED--))(&(objectclass=inetOrgPerson)(|(cn=wy*)(mail=wy*)(sn=wy*))));Base=ou=contacts,o=gcontact;Scope=2;Attributes=[LDAPAttribute: {type='cn'}, LDAPAttribute: {type='commonname'}, LDAPAttribute: {type='mail'}, LDAPAttribute: {type='objectClass'}]
dn : uid=--ELIDED--,ou=contacts,o=gcontact
uid : --ELIDED--
owner : --ELIDED--
mail : --ELIDED--
commonname : --ELIDED--
cn : --ELIDED--
sn : --ELIDED--
telephonenumber : --ELIDED--
givenname : --ELIDED--
objectClass : inetOrgPerson
jpegphoto : \303\277\303\230\303\277\303\240^@^PJFIF^@^A^A....
myVdReturnEntry: true
That's looking at the log from an emacs buffer, so the representation of non-printable characters may be unfamiliar. This later log entry confirms it in a more conventional format:
[2020-03-15 12:51:19,119][pool-6-thread-1] DEBUG CODEC_LOG - Encoded message
MessageType : SEARCH_RESULT_ENTRY
Message ID : 2
Search Result Entry
Entry
dn: uid=--ELIDED--,ou=contacts,o=gcontact
objectClass: 0x69 0x6E 0x65 0x74 0x4F 0x72 0x67 0x50 0x65 0x72 0x73 0x6F 0x6E
owner: --ELIDED--
mail: --ELIDED--
telephonenumber: --ELIDED--
jpegphoto: 0xC3 0xBF 0xC3 0x98 0xC3 0xBF 0xC3 0xA0 0x00 0x10 0x4A 0x46 0x49 0x46 0x00 0x01 ...
cn: --ELIDED--
uid: --ELIDED--
commonname: --ELIDED--
givenname: --ELIDED--
sn: --ELIDED--
I'm using MyVD 1.0.6. I tracked the problem down to net.sourceforge.myvd.inserts.jdbc.JdbcEntrySet.hasMore(). It assumes that values coming from the database are strings:
HashSet<String> attrib = attribs.get(ldapField);
if (attrib == null) {
attrib = new HashSet<String>();
attribs.put(ldapField,attrib);
}
String value = rs.getString(dbField);
if (! attrib.contains(value)) {
attrib.add(value);
}
Later, when it converts things to LDAP-ish, it calls LDAPattribute.add(String). It's probably somewhere inside the com.novell.ldap
code that does the UTF-8 encoding when it converts it to a byte[].
HashSet vals = attribs.get(attribName);
if (vals.size() > 0) {
LDAPAttribute attrib = new LDAPAttribute(attribName);
ldapAttribs.add(attrib);
Iterator<String> valIt = vals.iterator();
while (valIt.hasNext()) {
String nextValue = valIt.next();
if (nextValue != null) {
attrib.addValue(nextValue);
} else {
if (attrib.size() == 0) {
ldapAttribs.remove(attrib);
}
}
}
}
I was able to test this by special-casing "jpegPhoto" in the above code. However, I don't know the proper way to decide what attributes ought to be binary and which ought to be strings (or numbers or whatever else LDAP allows).
I'm guessing that the LDAP schema info read by MyVD tells what the target type should be. The JDBC ResultSetMetaData would give the DB type. A complete and generic implementation would have a matrix of mappings of various DB types to various LDAP types, but maybe a good heuristic would be just using the DB type to decide binary vs string. The LDAPAttribute class only provides for byte[] and String attribute values anyhow.
After a little more poking around ....
Since JdbcInsert seems to read all DB values using getString(), I think it's up the the JDBC driver to decide how to provide that String. In my case, it's MySQL. After some limited looking at the MySQL driver source, I believe it tries to use the collation character encoding for the column. For the binary datatypes, there is no collation character encoding, so it falls back to the JVM default charset. In my case, that's UTF-8. In other words, I think the JDBC driver is taking the contents of my binary column and converting it to a Java String with the assumption that it's UTF-8 encoded. Of course, it's not UTF-8 encoded, so the interpretation is faulty and might even end up with whatever a UTF-8 decoder does when it sees something it doesn't understand.
It's the result of that first round of UTF-8 encoding that then gets converted back to a byte[] when the string is handed to the LDAPattribute class. If my original binary data were actually UTF-8 encoded, I think that decode/encode would probably give back the correct binary data. That wouldn't help, though, since that would still put the attribute value on the wire as UTF-8 encoded.
I haven't yet been able to duplicate in standalone Java the series of decoding and encoding that I theorize is going on. But even if I did, the situation seems a bit fragile. My plan now is to switch to storing the jpegPhoto data in my database as something safe (base64 or hex strings or something) that can be carried around as an ASCII-compatible string. I'll then use an insert to convert that ASCII-compatible string to the binary value that I need for the LDAP clients. I'm pretty sure that will work (Insert.postSearchEntry() seems to behave the way I want it to).
If all that works out, then the JdbcInsert handling of binary data still looks like a bug to me, but one that will no longer be important to my use case.
Thats for the great details. need to add a check against the binary attributes list in the jdbc insert and load as a blob when needed
The technique of storing the JPEGs as base64 and converting the values with an insert worked out well. The new insert is in the collection of inserts mentioned in issue #78.
I'm having problems supplying jpegPhoto attributes in LDAP results. I'd be happy to find out I was just doing something wrong.
The JPEG bytes are stored as a column in a MySQL table. I've tried VARBINARY and MEDIUMBLOB with similar (bad) results. If I save the column values to a file, various applications are happy to display it as JPEG.
When I return the results as the "jpegPhoto" attribute in an inetOrgPerson, the values the client receives are not correct (I'm looking at the bytes on the wire via wireshark). It looks kind of like MyVD is taking that binary blob and delivering it UTF-8 encoded. Here is an example of the beginning of one of those JPEGs:
Database:
Wireshark:
The bytes with high bits look like they are translated to UTF-8 encodings.
Any ideas for things I should do or try?