mlc_llm refactor (no more mlc_chat)

rgobbel commented 3 months ago

I've been trying to get some version of Mixtral-8x7b-Instruct-0.1 running on my 64GB AGX Orin box. The first failure was "model type mixtral not supported", even though mixtral appears in the list of models supported in mlc_chat. I've almost gotten this working, and I think I've tracked the confusion down to a major refactor in the mlc_llm codebase just 2 weeks ago, wherein mlc_chat simply disappeared, to be replaced by mlc_llm.

dusty-nv commented 3 months ago

Hi @rgobbel, it does appear that it appears in the mlc_chat supported models list in the dustynv/mlc:r36.2.0 container, however sounds like it's not working. I need to update/rebuild the MLC version and probably update the patches for it too. They had been changing to a new model builder (mlc_llm.build vs mlc_chat)

rgobbel commented 3 months ago

I ran it through the stages that had been handled by mlc_llm.build by hand, and with a little extra massaging (specifying chat-template on the command line, for example), it's working! It definitely has lower latency than Llama 2-70b, and seems to do at least as well w/r/t content.

dusty-nv commented 3 months ago

Oh that's great @rgobbel! What kind of tokens/sec do you get out of it? On Llama-2-70B I get a max of ~5 tokens/sec on AGX Orin 64GB

rgobbel commented 3 months ago

Oh that's great @rgobbel! What kind of tokens/sec do you get out of it? On Llama-2-70B I get a max of ~5 tokens/sec on AGX Orin 64GB

I haven't actually tried to do that measurement as yet. How do you recommend doing it in the Jetson containers?

dusty-nv commented 3 months ago

@rgobbel - it would be like this: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc#benchmarks

that is basically just a wrapper around MLC's benchmark. I believe it should work with both mlc_llm.build and mlc_chat based models (but if not, there is mlc_chat bench for the later)

rgobbel commented 3 months ago

Ok, here's what I got:

model	quantization	input tokens	output tokens	prefill time	prefill rate	decode time	decode rate	memory
Llama-2-7b-chat-hf-q4f16_ft	q4f16_ft	16	128	0.33	47.87	2.69	47.63	816.75
Llama-2-70b-chat-hf-q4f16_ft	q4f16_ft	16	128	3.20	5.00	25.60	5.00	899.19
Mixtral-8x7B-Instruct-v0.1-q4f16_1	q4f16_1	16	128	1.34	11.99	6.10	21.00	1042.19

Full results including prompts and outputs: llm-benchmarks.zip

Note: due to the fact that I was hand-coding a few bits that really should be more automated, just to make sure it worked at all, this was compiled without several possibly important optimizations, including CUDA graph execution, flash attention, and separate embedding. The new API makes things a bit confusing, but I'm working on it.

dusty-nv commented 3 months ago

Thanks @rgobbel, that 21 tokens/sec for Mixtral 8x7B looks good and is consistent from what I've heard from other ppl trying it through MLC. I should add it to the LLM benchmarks on Jetson AI Lab. If you get it running faster, let me know.

If would say just to use my local_llm API, because I wrap up a lot of the model builder and API stuff in MLC (including transparent support for both mlc_llm.build and mlc_chat), however I don't have support yet in there for SWA inference (the sliding-window attention that Mistral uses)

rgobbel commented 3 months ago

Ok, exactly which local_llm image is that (with mixtral support working correctly)?

The default image (dustynv/local_llm:r36.2.0) tries to use mlc_llm.build, which then errors out partly because mixtral is not in a list of supported model types. As I recall, simply hand-patching it to include mixtral didn't work either. I got a working model by modifying packages/llm/local_llm/models/mlc.py to call mlc_chat convert, mlc_chat get_config, and mlc_chat compile, with arguments that worked for each of those, but I didn't see a higher-level function that called all of those correctly, so some flags were not set as I'd have liked.

I've tried build tvm and mlc_llm in various ways, but it always seems to run into one roadblock or another. I'm currently wrestling with (on a bare metal build of mlc_llm):

/usr/local/lib/python3.10/dist-packages/tvm/3rdparty/cutlass_fpA_intB_gemm/cutlass_kernels/../weightOnlyBatchedGemv/kernel.h(362): error: identifier "__hfma2" is undefined
                      v = __hfma2(*reinterpret_cast<half2*>(weights_f16 + y), *reinterpret_cast<half2*>(in_v + y), v);
                          ^
          detected during:
            instantiation of "void tensorrt_llm::kernels::weight_only_batched_gemv<QType,WeightOnlyFlag,ActOp,Zero,Bias,NPerBlock,Batch,BlockSize>(const uint8_t *, const half *, const half *, const half *, const half *, half *, int, int, int) [with QType=tensorrt_llm::kernels::WeightOnlyQuantType::Int8b, WeightOnlyFlag=tensorrt_llm::kernels::WeightOnlyPerChannel, ActOp=tensorrt_llm::kernels::GeluActivation, Zero=true, Bias=true, NPerBlock=2, Batch=3, BlockSize=256]" at line 436
            instantiation of "void tensorrt_llm::kernels::WeightOnlyBatchedGemvKernelLauncher<QType, WeightOnlyFlag, ActOp, Zero, Bias, NPerBlock, Batch, BlockSize>::run(const tensorrt_llm::kernels::WeightOnlyParams &, cudaStream_t) [with QType=tensorrt_llm::kernels::WeightOnlyQuantType::Int8b, WeightOnlyFlag=tensorrt_llm::kernels::WeightOnlyPerChannel, ActOp=tensorrt_llm::kernels::GeluActivation, Zero=true, Bias=true, NPerBlock=2, Batch=3, BlockSize=256]" at line 24 of /usr/local/lib/python3.10/dist-packages/tvm/3rdparty/cutlass_fpA_intB_gemm/weightOnlyBatchedGemv/weightOnlyBatchedGemvBs3Int8b.cu

For some reason, it has a hard time finding CUDNN includes and libraries. Anyway, I'd much rather have a Docker image that just works.

dusty-nv commented 3 months ago

@rgobbel either use dustynv/local_llm:r36.2.0 or dustynv/mlc:r36.2.0 (which both use MLC commit 607dc5a), but use MLC commands/libraries directly instead of my local_llm wrappers (as mentioned above, I don't have support for Mistral/Mixtral and SWA inferencing in that yet)

rgobbel commented 3 months ago

I've managed to get Mistral both and Mixtral built, and Mixtral works very well, but for some reason the Mistral models don't work with the Web chat agent, even though the "chat" command of mlc_llm/mlc_chat works fine. Mistral models (but not Mixtral) are winding up missing the embed function, and as you mentioned there's no separate embedding support for Mistral.

There's another minor issue with Mistral, in that there is no way to tell it that there is no sliding window. One feature of Mistral-7b-Instruct-v0.2 is the lack of a sliding window, so this needs to be a parameter that can be passed in.

I'd be happy to submit PRs for any of this if I could manage to get a clean build. The farthest I've gotten with the latest version of mlc_llm runs into a problem in compiling tvm/3rdparty/cutlass_fpA_intB_gemm as part of the mlc_llm build, as mentioned above. If you have any suggestions about this issue I'd love to hear them.

dusty-nv / jetson-containers

mlc_llm refactor (no more mlc_chat) #451