Java bindings assume strings are utf-8

stephenh commented 6 years ago

Despite the watchman docs noting that in bser strings may not be utf-8, and are unknown encoding due to the nature of file system apis, currently the java deserializer assumes utf-8:

https://github.com/facebook/watchman/blob/master/java/src/com/facebook/watchman/bser/BserDeserializer.java#L236

This causes exceptions when some rogue process (I haven't been able to find out which yet), writes non-utf-8 file names:

Caused by: java.nio.charset.MalformedInputException: Input length = 1
        at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
        at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:813)
        at com.facebook.watchman.bser.BserDeserializer.deserializeString(BserDeserializer.java:236)
        at com.facebook.watchman.bser.BserDeserializer.deserializeRecursiveWithType(BserDeserializer.java:332)
        at com.facebook.watchman.bser.BserDeserializer.deserializeTemplate(BserDeserializer.java:302)
        at com.facebook.watchman.bser.BserDeserializer.deserializeRecursiveWithType(BserDeserializer.java:338)

I suppose this is expected, but it blows up the entire deserialize, so it's not really easy/possible for me (AFAICT?) to recover from this. I have to stop my process, go find the rogue file, delete it, and then restart the process.

Is there an alternative way for BserDeserializer to handle this that would let me continue processing? Maybe just log an error and return null? I know that is also surprising, but knowing name is nullable is at least a check I can perform.

wez commented 6 years ago

Ah, that's unfortunate. The purist in me would say that that function should really return a byte array, but I recognize that strings are generally more convenient. I'm not much of a java person; is there a reasonable Either<String, Byte[]> type that we could employ here instead?

stephenh commented 6 years ago

No, Java is not hip enough yet to have Either baked into the standard library.

Granted, what bser returns is just a map of string key -> object value, so if deserializeString was changed to return Object, it could return the byte[] after catching the encoding exception.

This would work for the deserializeString that is used for the map value; deserializeString is also used for the map key, but it seems ~somewhat "safe" to just do a cast from the now-Object return type to a string, with the assumption that the keys should also be controlled/utf-8 values.

For clients of the library, this would be their map.get("name") could be either a string or byte[]. So, still a boundary case they'd have to handle, but seems reasonable given the current map/untyped API they use anyway.

mkillianey commented 6 years ago

If you're looking for a temporary workaround, maybe you could avoid the exceptions by changing the CodingErrorAction in the utf8Encoder to IGNORE or REPLACE?

I assume that would cause you to get events where the filename was sometimes-close-but-wrong, so I don't know if that's better or worse than what you're experiencing now.

stephenh commented 6 years ago

@mkillianey yeah, thanks for the suggestion. For what I'm doing specifically, I'm syncing files between machines, so having a slightly-wrong filename would contribute to the oddness, so I'd see that as a ~potentially valid (but not really) file, and try to read it and create it on the remote side.

I think just being honest about "...yeah, this isn't utf-8" and returning either a string or a byte[] would be the best.

I can put together a PR if that sounds good?

wez commented 6 years ago

@stephenh Yes please; I think it's much easier to see and understand the change when presented with real code, so a PR would be great! I am leaning towards the either string|byte approach, and I'm mostly wondering how that ends up looking for the consumers of the library.

facebook / watchman

Java bindings assume strings are utf-8 #643