TremoloSecurity / MyVirtualDirectory

Open Source LDAP Virtual Directory
Apache License 2.0
45 stars 18 forks source link

what is the correct way to do binary attributes like jpegPhoto? #82

Open wjcarpenter opened 4 years ago

wjcarpenter commented 4 years ago

I'm having problems supplying jpegPhoto attributes in LDAP results. I'd be happy to find out I was just doing something wrong.

The JPEG bytes are stored as a column in a MySQL table. I've tried VARBINARY and MEDIUMBLOB with similar (bad) results. If I save the column values to a file, various applications are happy to display it as JPEG.

When I return the results as the "jpegPhoto" attribute in an inetOrgPerson, the values the client receives are not correct (I'm looking at the bytes on the wire via wireshark). It looks kind of like MyVD is taking that binary blob and delivering it UTF-8 encoded. Here is an example of the beginning of one of those JPEGs:

Database:

00000000: ffd8 ffe0 0010 4a46 4946 0001 0100 0001  ......JFIF......
00000010: 0001 0000 ffe1 002a 4578 6966 0000 4949  .......*Exif..II
00000020: 2a00 0800 0000 0100 3101 0200 0700 0000  *.......1.......
00000030: 1a00 0000 0000 0000 476f 6f67 6c65 0000  ........Google..
00000040: ffdb 0084 0003 0202 0808 0708 0809 0807  ................
00000050: 0707 0808 0708 0808 0708 0709 0807 0808  ................
00000060: 0808 0806 0808 0807 0807 0707 0707 0807  ................
00000070: 0708 0a07 0707 0809 0909 0707 0d0d 0a08  ................
00000080: 0d07 0809 0801 0304 0406 0506 0a06 060a  ................
00000090: 0f0d 0a0d 0f0d 0d0d 0d0d 0d0d 0d0d 0d0d  ................
000000a0: 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d  ................
000000b0: 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d 0d0d  ................

Wireshark:

00000000: c3bf c398 c3bf c3a0 0010 4a46 4946 0001  ..........JFIF..
00000010: 0100 0001 0001 0000 c3bf c3a1 002a 4578  .............*Ex
00000020: 6966 0000 4949 2a00 0800 0000 0100 3101  if..II*.......1.
00000030: 0200 0700 0000 1a00 0000 0000 0000 476f  ..............Go
00000040: 6f67 6c65 0000 c3bf c39b 00e2 809e 0003  ogle............
00000050: 0202 0808 0708 0809 0807 0707 0808 0708  ................
00000060: 0808 0708 0709 0807 0808 0808 0806 0808  ................
00000070: 0807 0807 0707 0707 0807 0708 0a07 0707  ................
00000080: 0809 0909 0707 0d0d 0a08 0d07 0809 0801  ................
00000090: 0304 0406 0506 0a06 060a 0f0d 0a0d 0f0d  ................

The bytes with high bits look like they are translated to UTF-8 encodings.

Any ideas for things I should do or try?

wjcarpenter commented 4 years ago

I've narrowed this down a little bit. It looks like the data is already UTF-8 encoded by the time DumpTransaction writes this entry:

[2020-03-15 12:51:19,099][pool-6-thread-1] DEBUG DumpTransaction - [DUMP] Begin Post Search Entry - Filter=(&(|(owner=--ELIDED--)(shareto=--ELIDED--))(&(objectclass=inetOrgPerson)(|(cn=wy*)(mail=wy*)(sn=wy*))));Base=ou=contacts,o=gcontact;Scope=2;Attributes=[LDAPAttribute: {type='cn'}, LDAPAttribute: {type='commonname'}, LDAPAttribute: {type='mail'}, LDAPAttribute: {type='objectClass'}]
dn : uid=--ELIDED--,ou=contacts,o=gcontact
uid : --ELIDED--
owner : --ELIDED--
mail : --ELIDED--
commonname : --ELIDED--
cn : --ELIDED--
sn : --ELIDED--
telephonenumber : --ELIDED--
givenname : --ELIDED--
objectClass : inetOrgPerson
jpegphoto : \303\277\303\230\303\277\303\240^@^PJFIF^@^A^A....
myVdReturnEntry: true

That's looking at the log from an emacs buffer, so the representation of non-printable characters may be unfamiliar. This later log entry confirms it in a more conventional format:

[2020-03-15 12:51:19,119][pool-6-thread-1] DEBUG CODEC_LOG - Encoded message 
 MessageType : SEARCH_RESULT_ENTRY
Message ID : 2
    Search Result Entry
Entry
    dn: uid=--ELIDED--,ou=contacts,o=gcontact
    objectClass: 0x69 0x6E 0x65 0x74 0x4F 0x72 0x67 0x50 0x65 0x72 0x73 0x6F 0x6E 
    owner: --ELIDED--
    mail: --ELIDED--
    telephonenumber: --ELIDED--
    jpegphoto: 0xC3 0xBF 0xC3 0x98 0xC3 0xBF 0xC3 0xA0 0x00 0x10 0x4A 0x46 0x49 0x46 0x00 0x01 ...
    cn: --ELIDED--
    uid: --ELIDED--
    commonname: --ELIDED--
    givenname: --ELIDED--
    sn: --ELIDED--
wjcarpenter commented 4 years ago

I'm using MyVD 1.0.6. I tracked the problem down to net.sourceforge.myvd.inserts.jdbc.JdbcEntrySet.hasMore(). It assumes that values coming from the database are strings:

HashSet<String> attrib = attribs.get(ldapField);
if (attrib == null) {
    attrib = new HashSet<String>();
    attribs.put(ldapField,attrib);
}
String value = rs.getString(dbField);
if (! attrib.contains(value)) {
    attrib.add(value);
}

Later, when it converts things to LDAP-ish, it calls LDAPattribute.add(String). It's probably somewhere inside the com.novell.ldap code that does the UTF-8 encoding when it converts it to a byte[].

HashSet vals = attribs.get(attribName);

if (vals.size() > 0) {
    LDAPAttribute attrib = new LDAPAttribute(attribName);
    ldapAttribs.add(attrib);
    Iterator<String> valIt = vals.iterator();
    while (valIt.hasNext()) {
        String nextValue = valIt.next();
        if (nextValue != null) {
            attrib.addValue(nextValue);
        } else {
            if (attrib.size() == 0) {
                ldapAttribs.remove(attrib);
            }
        }
    }
}

I was able to test this by special-casing "jpegPhoto" in the above code. However, I don't know the proper way to decide what attributes ought to be binary and which ought to be strings (or numbers or whatever else LDAP allows).

I'm guessing that the LDAP schema info read by MyVD tells what the target type should be. The JDBC ResultSetMetaData would give the DB type. A complete and generic implementation would have a matrix of mappings of various DB types to various LDAP types, but maybe a good heuristic would be just using the DB type to decide binary vs string. The LDAPAttribute class only provides for byte[] and String attribute values anyhow.

wjcarpenter commented 4 years ago

After a little more poking around ....

Since JdbcInsert seems to read all DB values using getString(), I think it's up the the JDBC driver to decide how to provide that String. In my case, it's MySQL. After some limited looking at the MySQL driver source, I believe it tries to use the collation character encoding for the column. For the binary datatypes, there is no collation character encoding, so it falls back to the JVM default charset. In my case, that's UTF-8. In other words, I think the JDBC driver is taking the contents of my binary column and converting it to a Java String with the assumption that it's UTF-8 encoded. Of course, it's not UTF-8 encoded, so the interpretation is faulty and might even end up with whatever a UTF-8 decoder does when it sees something it doesn't understand.

It's the result of that first round of UTF-8 encoding that then gets converted back to a byte[] when the string is handed to the LDAPattribute class. If my original binary data were actually UTF-8 encoded, I think that decode/encode would probably give back the correct binary data. That wouldn't help, though, since that would still put the attribute value on the wire as UTF-8 encoded.

I haven't yet been able to duplicate in standalone Java the series of decoding and encoding that I theorize is going on. But even if I did, the situation seems a bit fragile. My plan now is to switch to storing the jpegPhoto data in my database as something safe (base64 or hex strings or something) that can be carried around as an ASCII-compatible string. I'll then use an insert to convert that ASCII-compatible string to the binary value that I need for the LDAP clients. I'm pretty sure that will work (Insert.postSearchEntry() seems to behave the way I want it to).

If all that works out, then the JdbcInsert handling of binary data still looks like a bug to me, but one that will no longer be important to my use case.

mlbiam commented 4 years ago

Thats for the great details. need to add a check against the binary attributes list in the jdbc insert and load as a blob when needed

wjcarpenter commented 4 years ago

The technique of storing the JPEGs as base64 and converting the values with an insert worked out well. The new insert is in the collection of inserts mentioned in issue #78.