Open cmcmaster1 opened 2 weeks ago
Hi @cmcmaster1 that's expected if you instantiate the same TransformersLLM
instance since the load
method which is actually loading the LLM
in memory, is being called within the Task
. That should be expected and desired if the LLM
you're trying to load is different, but when the LLM
is the same it's true that replicating the same instance more than once makes no sense.
Could you share more information about your particular case? Why do you need to load the same LLM
more than once? Is there any clear benefit for that? Happy to help and look for a solution! 🤗
cc @gabrielmbmb @plaguss
I'm replicating the DEITA pipeline (https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/deita/) from the docs using TransformersLLM
(Llama 3 70B instruct) in place of OpenAILLM
, therefore using for several steps in the pipeline.
Hi @cmcmaster1, when using TransformersLLM
you need to load the model for each task. If you're using the same model for all the Task
s and want to load just once the model, I would recommend you serving Llama 3 Instruct with the vLLM server, and then use OpenAILLM
in the 4 tasks with the base_url
pointing to the vLLM server.
Yeah, that was how I was originally doing it, but vLLM lacks bitsandbytes support and all the other quants are pretty awful. Ah well, looks like there is a PR for bitsandbytes support for vLLM, so I'll wait for that/build from that PR.
Oh, the vLLM bitsandbytes support is only single GPU.
I've tried to implement a singelton pattern for theTransformersLLM
, but it's slow going. I'll try again later.
If I take something like the DEITA pipeline from the docs and replace OpenAILLM with TransformersLLM, running the pipeline will load my hf transformers model 4 times. Am I doing something wrong here? This seems like highly undesirable behaviour.