RFC: ByteString version of moduleLLVMAssembly

tmcdonell commented 10 years ago

I'd like a version of moduleLLVMAsembly that outputs a ByteString rather than a String (which is a silly datatype...). Basically this:

moduleLLVMAssembly' :: Module -> IO BS.ByteString
moduleLLVMAssembly' (Module m) = do
  withBufferRawOStream (liftIO . FFI.writeLLVMAssembly m)

I could put in a pull request, but this is also a request for comments on what the name should be, whether this should just replace the existing String version, etc etc...

tmcdonell commented 10 years ago

Maybe, ByteString versions of all of the things, or maybe some kind of class (which it looks like decodeM is using internally?).

tmcdonell commented 10 years ago

At a specific use case, moduleTargetAssembly is doing it's thing and then converting from a ByteString (in LLVM.General.Internal.Module), which I then immediately convert back to a ByteString and hand to other processes.

My feel-pinion would be that ByteSting is a better representation for (a) large chunks of text, and (b) things that are likely to interface to external tools.

bscarlet commented 10 years ago

Do you have a concrete performance measurement to drive this request?

I do not want to get into the mess that is Haskell string types without a strong motivating example.

tmcdonell commented 10 years ago

Sure, Haskell String is a mess, but isn't that because it's a terrible default for long things? ByteString is already used in the module (and this function!), it seems like an obvious target to make everything consistent and fast.

I can work around this for the time being, but will put together an example once I have larger programs running properly.

(Also, I point out this ticket is titled as a request for comments/discussion.)

bscarlet commented 10 years ago

I don't think ByteString is an obvious target. I don't think there is an obvious target - that's what's so sad about the Haskell string mess. I chose String as the most standard of several evils.

Certainly fully forced long Strings are unfortunate (and I've probably done that in moduleLLVMAssembly), but there's a lot more to the performance side of the story - though I'm no expert.

ByteString is not unicode clean, and so not a good drop-in replacement for String in general or in the llvm-general interface in particular. Using it would probably be faster. Using it everywhere would be consistent. It would also (as it does everywhere it's used) give up on static typing entirely and throw a bunch of bytes at the user of the API, burdening them and everyone downstream of them with getting the interpretation of those bytes right. I'm happy to use it internally for what it is - a string of bytes (not characters) - because in doing so I codify my knowledge that the LLVM libraries use the UTF8 encoding. In point of fact, ByteString might actually be the right option specifically for moduleTargetAssembly, specifically if LLVM deviates from its general use of UTF8 in that case because of the definitions of one or more of the assembly languages. In that case ByteString with its "here's some bytes, don't ask me how to interpret them" semantics would be a better type.

I think navigating this space is easier with a guide beacon - something specific to make better. Get me that concrete performance measurement, and I'll have something to work with. Maybe a subtle tweak or two to make sure the String's you're worried about don't get forced 'til they're about to be consumed anyway, keeping the working set small - or even get optimized away entirely. Maybe another instance of the RawOStream interface so LLVM can write the bytes directly where you need them. Probably something else.

bscarlet commented 10 years ago

Please re-open this issue if or when you produce a concrete performance case.

tmcdonell commented 10 years ago

Sorry this took so long, my project using llvm-general is still being written, so there were many things to get up and running...

Anyway, here a benchmark for just the moduleTargetAssembly function, returning either String or ByteString. The test program is this ray tracer which produces this target assembly (NVPTX):

benchmarking String
collecting 100 samples, 1 iterations each, in estimated 8.195496 s
mean: 75.40520 ms, lb 75.14194 ms, ub 75.72083 ms, ci 0.950
std dev: 1.481153 ms, lb 1.268655 ms, ub 1.778854 ms, ci 0.950
found 3 outliers among 100 samples (3.0%)
  3 (3.0%) high mild
variance introduced by outliers: 12.316%
variance is moderately inflated by outliers

benchmarking ByteString
collecting 100 samples, 1 iterations each, in estimated 6.529689 s
mean: 65.79186 ms, lb 65.69811 ms, ub 65.95023 ms, ci 0.950
std dev: 612.7574 us, lb 415.9915 us, ub 905.0382 us, ci 0.950

I tested whether the String is being efficiently consumed by the following step, but sadly it was not. Rewrite rules might accomplish this, but that approach tends to be fragile.

For reference, the actual computation time for an 800x600 scene is:

benchmarking ray
collecting 100 samples, 1 iterations each, in estimated 102.2201 s
mean: 11.07879 ms, lb 11.02496 ms, ub 11.15489 ms, ci 0.950
std dev: 324.9567 us, lb 248.8021 us, ub 499.7387 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  1 (1.0%) high severe
variance introduced by outliers: 23.865%
variance is moderately inflated by outliers

So, you can do a lot in 10 ms.

I don't have the option to reopen the ticket, so I leave that to your discretion.

bscarlet commented 10 years ago

If I understand right, you've pointed me at a program which, through Accelerate, could be passed to several backends. I dug around in the repository, and found a directory for an LLVM backend, but it contained no code.

If you've pointed me at a use of llvm-general, I can't find it. To speed up the data path you're concerned with, I need to see the code for that data path.

tmcdonell commented 10 years ago

You have deduced correctly. I'm currently developing the code as part of an internship. I have permission to release it (hence the currently empty accelerate-llvm repo) but, due to various reasons, only towards the end of my internship (around the end of the month).

This particular test case that I gave, takes the moduleTargetAssembly and then compiles it into something executable on the GPU, using a different FFI function loadDataEx. So in this case the use of String takes a hit twice, since we've really gone ByteString -> String -> ByteString. (The input to loadDataEx really is just a bunch of bytes, because it accepts both text and binary inputs.)

I can point you to the actual implementation once it's released.

bscarlet commented 10 years ago

You could either wait 'til you can point me to the code, or make a stripped down and/or tweaked example. From my perspective the latter is better, as much as your project looks interesting it's more efficient for me if I don't have to learn it to understand the example. Can you tell me the signature of loadDataEx? How does the underlying function you're calling through FFI handle the memory ownership? Does the bytestring you pass to it get copied to something else under your control, or consumed as is (or by something opaque)?

tmcdonell commented 10 years ago

The implementation of loadDataEx is here: https://github.com/tmcdonell/cuda/blob/master/Foreign/CUDA/Driver/Module.chs#L178

It treats the ByteString as a read-only chunk of memory that gets passed to C. It doesn't do anything else with it other than it over the FFI. It doesn't take ownership of it or modify it in any way.

It's probably not important to know anything about loadDataEx, I just mentioned it because that is my particular use case.

tmcdonell commented 10 years ago

Actually, even simpler is this: https://github.com/tmcdonell/cuda/blob/master/Foreign/CUDA/Driver/Module.chs#L165

It is the same as loadDataEx, you just don't get any options. As you can see, it just takes the bag of bits that is the ByteString and passes it to the foreign function.

bscarlet commented 10 years ago

Can you change or add to Module.chs? The marshaling code in there copies the ByteString just to add the null termination required by the NVIDIA function. My code is currently doing another copy to get the data from the stream structure LLVM can write into into the ByteString. If you could expose the NVIDIA function as a (Ptr CChar -> IO Module) (skipping the marshaling from the ByteString), I could provide a version of llvmTargetAssembly something like (... -> (Ptr CChar -> IO a) -> IO a) which would add the null terminator at the end of the LLVM stream and then pass the pointer straight to you skipping two memcpys. I really don't know if that'll have much additional impact for your case, but if I'm going to tweak the API I'd prefer to take an approach something like that which doesn't force the extra copies.

It's too bad NVIDIA doesn't expose more in their API, really. Even with my suggested approach there'd be some dumb copying going on in the stream I'm using.

tmcdonell commented 10 years ago

Ah right, I forgot about the extra copy in useAsCString. I can make a version of loadData* that takes a Ptr CChar instead of a ByteString.

I did notice that internally you use a lot of these streaming-type functions, including to build Strings efficiently (which was pretty cool). Maybe that can be exposed cleanly in some way (although not as the default). That could provide a uniform interface for other functions that currently output String as well, in case others ever need it.

(For example, in a different context I managed to work around use of moduleLLVMAssembly by using moduleBitcode, because the NVIDIA functions are weird and accept either text or binary as input.)

tmcdonell commented 10 years ago

I have added module loading functions that read directly from a Ptr, e.g.: https://github.com/tmcdonell/cuda/blob/master/Foreign/CUDA/Driver/Module.chs#L185

bscarlet / llvm-general

RFC: ByteString version of moduleLLVMAssembly #93