fdschmidt93 / trident-nllb-llm2vec

Repository for "Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages"
MIT License
13 stars 0 forks source link

Release of Pre-trained models #2

Open ArkadeepAcharya opened 5 months ago

ArkadeepAcharya commented 5 months ago

Request for Release of Pretrained NLLB-LLM2Vec Model

Hello Team,

Could you please release the pretrained NLLB-LLM2Vec models mentioned in your paper on "Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages"? It would greatly benefit the community by facilitating further research.

Thank you for your contributions.

Best regards, Arkadeep Acharya

fdschmidt93 commented 5 months ago

Hi Arkadeep,

Thanks a lot for your interest in our work!

Yes, I very much plan on making the models available. :)

I am currently working on refining Stage 1, such that Stage 2 won't be necessary. My sincere hope is that I can then release a single pre-trained model which can easily be fine-tuned on any downstream task without task distillation for maximum performance.

In any case, I will make the self-supervised adapted model (S1) of the paper available asap. Unfortunately, directly fine-tuning that will only give you good performance if you have sizable training data (like for NLI, Belebele).

Cheers, Fabian

ArkadeepAcharya commented 5 months ago

Thanks Fabian! Looking forward to the model release!

fdschmidt93 commented 4 months ago

As a quick update. Sharing the model on Hugging Face hub is surprisingly difficult, since it has to be correctly quantized and LoRAfied prior to loading the weights. transformers, peft, bitsandbytes don't play that easily nicely together when setting up an AutoModel.from_pretrained the conventional way. Unfortunately, none of this is really well documented.

Between having been sick and working on the more general model, I haven't yet had sufficient time how to best upload the model in a way that it is most easily used, i.e.

from transformers import AutoModel
model = AutoModel.from_pretrained("fdschmidt/nllb-llm2vec-v0.1")

I might have to package it more generally as an nn.Module (cf. https://huggingface.co/docs/hub/models-uploading#upload-a-pytorch-model-using-huggingfacehub). I'll be on vacation next week but will try to squeeze it in.

AlphaNumeric99 commented 1 month ago

Hey Fabian!

I know you are probably busy building some exciting stuff but have you had the chance to upload the weights? Even a link to S3 or Google drive is much appreciated.

My primary interest is finetuning it further.

Thanks

fdschmidt93 commented 1 month ago

Hi there,

I'm actually while writing trying to iron out the very last issues ( famous last words :crossed_fingers: ) of an initial release of NLLB-LLM2Vec on Llama 3.1 8B.

That release will support seamless

AutoModel.from_pretrained("fdschmidt93/...")
AutoModelForSequenceClassification.from_pretrained("fdschmidt93/...")
AutoModelForTokenClassification.from_pretrained("fdschmidt93/...")

Unfortunately wrapping true custom models onto the Huggingface Hub is simply a bad developer experience (undocumented and unusual behavior, plus other stuff like https://github.com/huggingface/transformers/pull/33844).

Anyhow, that release should ideally be close in performance of S1+S2 while only doing S1+FT (if that's unclear, please refer to the paper).

fdschmidt93 commented 1 month ago

https://huggingface.co/fdschmidt93/NLLB-LLM2Vec-Meta-Llama-31-8B-Instruct-mntp-unsup-simcse

Here is the model with usage instructions on the README. The model doesn't have that much mileage on it yet, so I'd kindly ask you to report any issues you run into it quickly and I'll fix them ASAP :)

AlphaNumeric99 commented 1 month ago

@fdschmidt93 Thanks a lot!

ArkadeepAcharya commented 1 month ago

Hi @fdschmidt93, Can you please clarify if this is a stage-1 trained model or if the model has gone through both Stage-1 and stage-2 training?

fdschmidt93 commented 1 month ago

Hi @ArkadeepAcharya

As stated on the readme of the model, this version has yet to be fine tuned for a downstream task. Hence, the model has only been trained for stage 1. It nevertheless should notably perform better than stage 1 in the paper as it has been trained

I unfortunately won't be releasing stage 2 models as I will not be finetuning models per task due to lack of time and compute.

There would be an argument to do something like GritLM (cf. Paper) and then do distillation to have a single model for 'all tasks', but I don't have the capacity (GPUs, time) to do that. I invested a lot of time in improving the self-supervised stage as much as possible.

NLLB-LLM2Vec should be used if you need sequence level embeddings for less resourced languages that industry-level models (samples, supervision, etc.) like NVEmbed, GritLM, E5, BGE don't cover. Or in actually academic settings where you want to be more sure that the task has not leaked (albeit instruction finetuning of Llama itself may have leakage).

I hope this clarifies any questions you might have. Let me know if there's more follow up you would like to discuss.