Convert uses of type `String` in LLVM AST to `ByteString`.

GaloisInc / llvm-pretty

An llvm pretty printer inspired by the haskell llvm binding

Other

28 stars 15 forks source link

Convert uses of type `String` in LLVM AST to `ByteString`. #23

Closed brianhuffman closed 6 years ago

brianhuffman commented 6 years ago

LLVM does not know anything about UTF8 or any other text encoding; it treats strings as just lists of bytes, and the LLVM string syntax can encode arbitrary sequences of bytes. For this reason, ByteString is a more honest representation of LLVM strings than type String.

Frontends like clang may perform UTF8 encoding on names when producing LLVM files; consumers of LLVM AST should then perform UTF8 decoding on ByteStrings to produce a final String or Text value.

brianhuffman commented 6 years ago

This change ended up being rather more pervasive than I had planned. Nearly all uses of String in llvm-pretty are replaced with ByteString. This patch also requires (mostly minor) changes in several other packages: llvm-pretty-bc-parser, llvm-verifier, crucible-llvm, and saw-script. I have pushed branches named bytestring in all of their respective repositories.

The changes in the other packages are probably a good thing overall, because they basically mean that unicode text and identifiers will be handled properly throughout the whole system.

elliottt commented 6 years ago

We'll want to do a major version increment next time a release is done, would you mind making that change to the cabal file now, so I don't forget?

brianhuffman commented 6 years ago

I thought it might be worthwhile to consider another alternative before merging. Instead of requiring consumers of LLVM ASTs to do UTF-8 decoding, we could instead have the LLVM bitcode parser do UTF-8 decoding as it builds the AST.

I'm not necessarily trying to argue against using ByteString; I just want to make sure we make a well-informed, deliberate decision here.

(I also started a discussion on our Galois Mattermost SAW channel; I'll post any conclusions from that here before taking any further actions.)

brianhuffman commented 6 years ago

The version number is at 0.7.3 now, so should it go up to 1.0?

elliottt commented 6 years ago

0.8 should be fine.

That's a good point -- we could just do the encoding when rendering out the bitcode. It adds a bit of complexity to the bitcode parser, but it would keep the API for llvm-pretty basically the same.

brianhuffman commented 6 years ago

I now have an alternative pull request for the other option: GaloisInc/llvm-pretty-bc-parser#64. This one is nice because the changes are confined to a single package, and no APIs need to change.

elliottt commented 6 years ago

I like the second option a bit more, it confines the formatting to pretty-printing, and restores the original string during parsing from the bitcode. Should we close this PR, or were there changes you needed to integrate with the bitcode parser?

brianhuffman commented 6 years ago

Yes, I think we should close this one. It was a reasonable idea to try, but it ended up being kind of a mess.