Fully support surrogates symbol in Strings

vladislav-sidorovich commented 2 years ago

    @Test
    void evilString() throws Exception {
        String messageText = "Hey Aerospike! Let's store the string 🙏.";
        String evilString = messageText.substring(0, 39) + "...";

        Key key = new Key("TEST", "map-strings", "test-key");

        String binName = "map-test";

        Map<Value,Value> inputMap = new HashMap<Value,Value>();
        inputMap.put(Value.get("text"), Value.get(evilString));
        inputMap.put(Value.get("type"), Value.get("missing data"));

        // Write values to empty map.
        aerospikeClient.operate(new WritePolicy(), key,
                MapOperation.putItems(MapPolicy.Default, binName, inputMap)
        );

        Record record = aerospikeClient.get(new Policy(), key);
        Map<?, ?> storedMap = record.getMap(binName);

        Assert.assertEquals("missing data", storedMap.get("type"));
        Assert.assertEquals(evilString, new String((byte[]) storedMap.get("text")));
    }

The root cause of the issue: https://github.com/aerospike/aerospike-client-java/blob/6fcfb23f7946b078427a197d5b3f828d0ee7fe53/client/src/com/aerospike/client/command/Buffer.java#L163

The effect is here: https://github.com/aerospike/aerospike-client-java/blob/8251d673a6ec573e662541cd6f045241db164467/client/src/com/aerospike/client/util/Packer.java#L403,L410

int size = Buffer.estimateSizeUtf8(val) + 1; return some X value
the value X is packed into the buffer
offset += Buffer.stringToUtf8(val, buffer, offset); return some Y value
offset is moved to Y position
X <> Y => data in the buffer are corrupted because of overlapping

Reference implementation: https://github.com/openjdk/jdk11u-dev/blob/c1411113b396f468963a1deacc3b57ed366e735a/src/java.base/share/classes/java/lang/StringCoding.java#L924-L950 or java.lang.String#encodeUTF8_UTF16 Amazon Correto 18

Notes: What are surrogates? https://unicode.org/faq/utf_bom.html#utf16-2

BrianNichols commented 2 years ago

This will be fixed in the next client release.

BrianNichols commented 2 years ago

The code sample creates a malformed string. I think the best solution is to detect and throw an exception in estimateSizeUtf8() when the string is malformed.

vladislav-sidorovich commented 2 years ago

From my point of view, an exception will be better than a corrupted document. At the same time, java.lang.String doesn't throw an exception. Also, I can send/receive such a string via REST (http).

So, I can send/receive such strings, and I can process it in the code in my service but I can't store it in long-term storage (Aerospike), it is a bit confusing, is it?

If aerospike-client will be able to process such strings it will be the best option for me.

BrianNichols commented 2 years ago

Java's getBytes(StandardCharsets.UTF_8) modifies malformed strings to include a "?" in place of the invalid surrogate pair when converting to UTF8. When the UTF8 bytes are converted back into a string, there is a mismatch between the original string and the converted string. This will cause problems for applications that test these strings for equality. In the interest of safety, the client will throw an exception when malformed strings are encountered in estimateSizeUtf8().

BrianNichols commented 2 years ago

Java client 6.1.3 is released: https://download.aerospike.com/download/client/java/notes.html

aerospike / aerospike-client-java

Fully support surrogates symbol in Strings #222