Handle invalid encoding

What does this PR do?

Ensure string is encoded to UTF-8
Ensure encoding goes through by transforming invalid characters for the original encoding

Motivation

The original finding was that libddwaf is able to handle input that contains characters that don't match the original encoding. Since we truncate at 4096 chars as understood by the original string encoding (e.g ASCII-8BIT) this may result in a truncated multibyte chars (e.g if it contains UTF-8 chars) being passed to libddwaf as a C string (more like byte array).

On a match, in value and highlight fields libddwaf will then return a C string that is understood by Ruby as being UTF-8. This will contain the original byte array. The occurence of an incomplete character produced a JSON.dump exception.

More generally the original string may thus theoretically contain characters that:

are valid in the original encoding but not in UTF-8
are valid or invalid in the original encoding but result in valid but different UTF-8 characters
are not valid in the original encoding but are in UTF-8, except they may be truncated as truncation assumes the original mismatching encoding instead of whatever the invalid characters are expected to be in, resulting in invalid UTF-8
are actual binary

By converting to UTF-8 we enable:

proper truncation at the UTF-8 character level
proper character semantic processing by libddwaf
consistency with the C string to Ruby string conversion, which assumes libddwaf returns UTF-8 strings

In the case that the original string has characters that don't make sense in the original encoding and thus cannot be converted, we convert them to the standard Unicode \u{FFFD} character meant for that purpose. Indeed keeping the original ones does not make sense as no reliable semantic can be possibly inferred for libddwaf to make sense of the data.

Additional Notes

It may be that this could be handled at the libddwaf output level, e.g maintaining the original string encoding for the value and highlight keys but it would not solve the semantic understanding of the original string by libddwaf, and still hit conversion issues when things are serialized to JSON later on.
I seem to recall that it was specified that libddwaf input data should be UTF-8 but I can't recall conversion or replacement specifics.

How to test the change?

Specs have been added to cover these cases.

DataDog / libddwaf-rb

Handle invalid encoding #33