Open stephenh opened 6 years ago
Ah, that's unfortunate. The purist in me would say that that function should really return a byte array, but I recognize that strings are generally more convenient.
I'm not much of a java person; is there a reasonable Either<String, Byte[]>
type that we could employ here instead?
No, Java is not hip enough yet to have Either
baked into the standard library.
Granted, what bser returns is just a map of string key -> object value, so if deserializeString
was changed to return Object
, it could return the byte[]
after catching the encoding exception.
This would work for the deserializeString
that is used for the map value; deserializeString
is also used for the map key, but it seems ~somewhat "safe" to just do a cast from the now-Object return type to a string, with the assumption that the keys should also be controlled/utf-8 values.
For clients of the library, this would be their map.get("name")
could be either a string or byte[]
. So, still a boundary case they'd have to handle, but seems reasonable given the current map/untyped API they use anyway.
If you're looking for a temporary workaround, maybe you could avoid the exceptions by changing the CodingErrorAction
in the utf8Encoder
to IGNORE
or REPLACE
?
I assume that would cause you to get events where the filename was sometimes-close-but-wrong, so I don't know if that's better or worse than what you're experiencing now.
@mkillianey yeah, thanks for the suggestion. For what I'm doing specifically, I'm syncing files between machines, so having a slightly-wrong filename would contribute to the oddness, so I'd see that as a ~potentially valid (but not really) file, and try to read it and create it on the remote side.
I think just being honest about "...yeah, this isn't utf-8" and returning either a string or a byte[]
would be the best.
I can put together a PR if that sounds good?
@stephenh Yes please; I think it's much easier to see and understand the change when presented with real code, so a PR would be great! I am leaning towards the either string|byte approach, and I'm mostly wondering how that ends up looking for the consumers of the library.
Despite the watchman docs noting that in bser strings may not be utf-8, and are unknown encoding due to the nature of file system apis, currently the java deserializer assumes utf-8:
https://github.com/facebook/watchman/blob/master/java/src/com/facebook/watchman/bser/BserDeserializer.java#L236
This causes exceptions when some rogue process (I haven't been able to find out which yet), writes non-utf-8 file names:
I suppose this is expected, but it blows up the entire
deserialize
, so it's not really easy/possible for me (AFAICT?) to recover from this. I have to stop my process, go find the rogue file, delete it, and then restart the process.Is there an alternative way for
BserDeserializer
to handle this that would let me continue processing? Maybe just log an error and returnnull
? I know that is also surprising, but knowing name is nullable is at least a check I can perform.