aerospike / aerospike-client-java

Aerospike Java Client Library
Other
236 stars 212 forks source link

Fully support surrogates symbol in Strings #222

Closed vladislav-sidorovich closed 2 years ago

vladislav-sidorovich commented 2 years ago
    @Test
    void evilString() throws Exception {
        String messageText = "Hey Aerospike! Let's store the string 🙏.";
        String evilString = messageText.substring(0, 39) + "...";

        Key key = new Key("TEST", "map-strings", "test-key");

        String binName = "map-test";

        Map<Value,Value> inputMap = new HashMap<Value,Value>();
        inputMap.put(Value.get("text"), Value.get(evilString));
        inputMap.put(Value.get("type"), Value.get("missing data"));

        // Write values to empty map.
        aerospikeClient.operate(new WritePolicy(), key,
                MapOperation.putItems(MapPolicy.Default, binName, inputMap)
        );

        Record record = aerospikeClient.get(new Policy(), key);
        Map<?, ?> storedMap = record.getMap(binName);

        Assert.assertEquals("missing data", storedMap.get("type"));
        Assert.assertEquals(evilString, new String((byte[]) storedMap.get("text")));
    }

The root cause of the issue: https://github.com/aerospike/aerospike-client-java/blob/6fcfb23f7946b078427a197d5b3f828d0ee7fe53/client/src/com/aerospike/client/command/Buffer.java#L163

The effect is here: https://github.com/aerospike/aerospike-client-java/blob/8251d673a6ec573e662541cd6f045241db164467/client/src/com/aerospike/client/util/Packer.java#L403,L410

  1. int size = Buffer.estimateSizeUtf8(val) + 1; return some X value
  2. the value X is packed into the buffer
  3. offset += Buffer.stringToUtf8(val, buffer, offset); return some Y value
  4. offset is moved to Y position
  5. X <> Y => data in the buffer are corrupted because of overlapping

Reference implementation: https://github.com/openjdk/jdk11u-dev/blob/c1411113b396f468963a1deacc3b57ed366e735a/src/java.base/share/classes/java/lang/StringCoding.java#L924-L950 or java.lang.String#encodeUTF8_UTF16 Amazon Correto 18

Notes: What are surrogates? https://unicode.org/faq/utf_bom.html#utf16-2

BrianNichols commented 2 years ago

This will be fixed in the next client release.

BrianNichols commented 2 years ago

The code sample creates a malformed string. I think the best solution is to detect and throw an exception in estimateSizeUtf8() when the string is malformed.

vladislav-sidorovich commented 2 years ago

From my point of view, an exception will be better than a corrupted document. At the same time, java.lang.String doesn't throw an exception. Also, I can send/receive such a string via REST (http).

So, I can send/receive such strings, and I can process it in the code in my service but I can't store it in long-term storage (Aerospike), it is a bit confusing, is it?

If aerospike-client will be able to process such strings it will be the best option for me.

BrianNichols commented 2 years ago

Java's getBytes(StandardCharsets.UTF_8) modifies malformed strings to include a "?" in place of the invalid surrogate pair when converting to UTF8. When the UTF8 bytes are converted back into a string, there is a mismatch between the original string and the converted string. This will cause problems for applications that test these strings for equality. In the interest of safety, the client will throw an exception when malformed strings are encountered in estimateSizeUtf8().

BrianNichols commented 2 years ago

Java client 6.1.3 is released: https://download.aerospike.com/download/client/java/notes.html