fe1ixxu / ALMA

State-of-the-art LLM-based translation models.
MIT License
395 stars 29 forks source link

Suggestion on foundation model #16

Closed cmp-nct closed 9 months ago

cmp-nct commented 10 months ago

I invested dozens of hours in trying to get the best translation results into german. From all models for translation available Alma 13 Lora is the best.

However, it's beaten by a margin from an old Falcon-40B fine tune when the language is complicated. For example sardonic/sarcastic style and complex english word combinations. Falcon isn't flawless, it makes grammar mistakes from time to time, that's the main flaw of all models except Alma-13B-Lora

So I wonder how much effort would it be to finetune Falcon 7 and Falcon 40 to an Alma-style ?

Here one of them: "No people to distract you, just the display case full of shiny ba baub ba of shiny baubles and trinkets taking centre stage."

Alma 13 Lora: "Keine Leute um dich herum, nur ein Regal voller glänzender Bauble-Baubles und Trinkgeld, das im Mittelpunkt steht."

DeepL: "Noch keine Menschen in Sicht, nur ein Schaufenster voller glänzender Glitzer-Dingelchen und Kleinigkeiten, die im Rampenlicht stehen."

falcon-40b-sft-mix-1226 in 3.5 bit quantization (!) "Noch keine Menschen in Sicht, nur ein Schaufenster voller glänzender Glitzer-Dingelchen und Kleinigkeiten, die im Rampenlicht stehen."

Leo 13B: "Keine Menschen, die ablenken, nur ein Schaukasten voller glänzender Armbanduhren und Kronleuchter, die im Mittelpunkt stehen."

Falcon 40B squeezed into a tiny quantization is on level of the commercial DeepL service on this one. Sadly ALMA variants failed, wrong translations on the word creation and on trinkets, also the whole meaning was not really taken in.

fe1ixxu commented 9 months ago

I apologize for the delayed response as I am currently attending EMNLP. First, I'd like to express my gratitude for your interest and the time you've dedicated to testing our models.

Regarding the training effort for Falcon-7B, it should be comparable to that of our ALMA-7B-Pretrain model. You can use the script mono_ft.sh, but make sure to replace --model_name_or_path meta-llama/Llama-2-7b-hf with --model_name_or_path tiiuae/falcon-7b. As for the 40B model, the effort required is significantly greater, which I haven't personally attempted. This is mainly due to CPU/GPU memory limitations, resulting in a longer processing time.

fe1ixxu commented 9 months ago

I also translated the same sentence on my side for this English sentence but I get a different result: "No people to distract you, just the display case full of shiny ba baub ba of shiny baubles and trinkets taking centre stage"

ALMA-13B-LoRA: Keine Menschen, die Sie ablenken, nur der Schaufenster voller glänzender ba baub ba von glänzenden baubeln und Souvenirs, die im Mittelpunkt stehen.

I am not a German-speaker but I put it in Google translate and I feel it maybe a better translation?

My env: \ transformers: 4.35.0.dev0 \ acceleration: 0.24.0

My code:

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-13B-Pretrain", torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, "haoranxu/ALMA-13B-Pretrain-LoRA")
tokenizer = LlamaTokenizer.from_pretrained("haoranxu/ALMA-13B-Pretrain", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from English to German:\nEnglish: No people to distract you, just the display case full of shiny ba baub ba of shiny baubles and trinkets taking centre stage.\nGerman:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=200, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=200, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)
cmp-nct commented 9 months ago

Hi,

I used "Übersetzen Sie dies vom Englischen ins Deutsche:" as prompt and I had the "Description from a friend." multi-shot prefix. That likely explains the difference.

However I have seen very similar results as yours and it's sadly not proper german. The word "Baubeln" does not exist, at least I never heard it and the dictionary didn't find it. The english pretext is very difficult language, I was surprised to see Falcon handle that so well. "glänzender Glitzer-Dingelchen" that is probably the best possible translation to it (I couldn't have done it, the english was too artistic)

I might really try a fine tune with your script, do you have any estimate in how many A100 hours (or A40 I guess) it might take for a Falcon 7B ?

fe1ixxu commented 9 months ago

I will say at most 300 GPU hours for 1B tokens training. Please check FAQs https://github.com/fe1ixxu/ALMA#when-should-i-stop-fine-tuning-at-stage-1 and our paper for more details! Thanks!