Closed lloeki closed 1 year ago
The changes LGTM
We could extract the encoding step into an intermediary variable to make the code more readable.
encoded_val = val.to_s.encode('utf-8', invalid: :replace, undef: :replace)
val = encoded_val[0, max_string_length] if max_string_length
What does this PR do?
Motivation
The original finding was that
libddwaf
is able to handle input that contains characters that don't match the original encoding. Since we truncate at 4096 chars as understood by the original string encoding (e.g ASCII-8BIT) this may result in a truncated multibyte chars (e.g if it contains UTF-8 chars) being passed to libddwaf as a C string (more like byte array).On a match, in
value
andhighlight
fieldslibddwaf
will then return a C string that is understood by Ruby as being UTF-8. This will contain the original byte array. The occurence of an incomplete character produced aJSON.dump
exception.More generally the original string may thus theoretically contain characters that:
By converting to UTF-8 we enable:
libddwaf
libddwaf
returns UTF-8 stringsIn the case that the original string has characters that don't make sense in the original encoding and thus cannot be converted, we convert them to the standard Unicode
\u{FFFD}
character meant for that purpose. Indeed keeping the original ones does not make sense as no reliable semantic can be possibly inferred forlibddwaf
to make sense of the data.Additional Notes
value
andhighlight
keys but it would not solve the semantic understanding of the original string by libddwaf, and still hit conversion issues when things are serialized to JSON later on.libddwaf
input data should be UTF-8 but I can't recall conversion or replacement specifics.How to test the change?
Specs have been added to cover these cases.