Closed brianhuffman closed 6 years ago
This change ended up being rather more pervasive than I had planned. Nearly all uses of String
in llvm-pretty are replaced with ByteString
. This patch also requires (mostly minor) changes in several other packages: llvm-pretty-bc-parser, llvm-verifier, crucible-llvm, and saw-script. I have pushed branches named bytestring
in all of their respective repositories.
The changes in the other packages are probably a good thing overall, because they basically mean that unicode text and identifiers will be handled properly throughout the whole system.
We'll want to do a major version increment next time a release is done, would you mind making that change to the cabal file now, so I don't forget?
I thought it might be worthwhile to consider another alternative before merging. Instead of requiring consumers of LLVM ASTs to do UTF-8 decoding, we could instead have the LLVM bitcode parser do UTF-8 decoding as it builds the AST.
I'm not necessarily trying to argue against using ByteString
; I just want to make sure we make a well-informed, deliberate decision here.
(I also started a discussion on our Galois Mattermost SAW channel; I'll post any conclusions from that here before taking any further actions.)
The version number is at 0.7.3 now, so should it go up to 1.0?
0.8 should be fine.
That's a good point -- we could just do the encoding when rendering out the bitcode. It adds a bit of complexity to the bitcode parser, but it would keep the API for llvm-pretty basically the same.
I now have an alternative pull request for the other option: GaloisInc/llvm-pretty-bc-parser#64. This one is nice because the changes are confined to a single package, and no APIs need to change.
I like the second option a bit more, it confines the formatting to pretty-printing, and restores the original string during parsing from the bitcode. Should we close this PR, or were there changes you needed to integrate with the bitcode parser?
Yes, I think we should close this one. It was a reasonable idea to try, but it ended up being kind of a mess.
LLVM does not know anything about UTF8 or any other text encoding; it treats strings as just lists of bytes, and the LLVM string syntax can encode arbitrary sequences of bytes. For this reason, ByteString is a more honest representation of LLVM strings than type String.
Frontends like clang may perform UTF8 encoding on names when producing LLVM files; consumers of LLVM AST should then perform UTF8 decoding on ByteStrings to produce a final String or Text value.