Multiple model instances

argilla-io / distilabel

⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

https://distilabel.argilla.io

Apache License 2.0

1.12k stars 70 forks source link

Multiple model instances #719

Open cmcmaster1 opened 2 weeks ago

cmcmaster1 commented 2 weeks ago

If I take something like the DEITA pipeline from the docs and replace OpenAILLM with TransformersLLM, running the pipeline will load my hf transformers model 4 times. Am I doing something wrong here? This seems like highly undesirable behaviour.

alvarobartt commented 2 weeks ago

Hi @cmcmaster1 that's expected if you instantiate the same TransformersLLM instance since the load method which is actually loading the LLM in memory, is being called within the Task. That should be expected and desired if the LLM you're trying to load is different, but when the LLM is the same it's true that replicating the same instance more than once makes no sense.

Could you share more information about your particular case? Why do you need to load the same LLM more than once? Is there any clear benefit for that? Happy to help and look for a solution! 🤗

cc @gabrielmbmb @plaguss

cmcmaster1 commented 2 weeks ago

I'm replicating the DEITA pipeline (https://distilabel.argilla.io/latest/sections/pipeline_samples/papers/deita/) from the docs using TransformersLLM (Llama 3 70B instruct) in place of OpenAILLM, therefore using for several steps in the pipeline.

gabrielmbmb commented 2 weeks ago

Hi @cmcmaster1, when using TransformersLLM you need to load the model for each task. If you're using the same model for all the Tasks and want to load just once the model, I would recommend you serving Llama 3 Instruct with the vLLM server, and then use OpenAILLM in the 4 tasks with the base_url pointing to the vLLM server.

cmcmaster1 commented 2 weeks ago

Yeah, that was how I was originally doing it, but vLLM lacks bitsandbytes support and all the other quants are pretty awful. Ah well, looks like there is a PR for bitsandbytes support for vLLM, so I'll wait for that/build from that PR.

cmcmaster1 commented 2 weeks ago

Oh, the vLLM bitsandbytes support is only single GPU. I've tried to implement a singelton pattern for theTransformersLLM, but it's slow going. I'll try again later.