[IDEA] Include a better way to translate dataset?

ChrystianSchutz commented 10 months ago

I have used the default translation from the step 2, but sadly a lot of those translations at least from English to Polish are gibberish and absolutely terrible. https://huggingface.co/datasets/chrystians/Jestes?row=3

I want to create a thread to start a discussion about possible alternatives, obvious one would be something like AWS translate or DeepL. And to do that we would need to write a script for API integration, I also don't know how costly is it or if there are any better opensource alternatives.

There are currently around 1(0M) 9949085 characters in the oasst1 dataset

mpazdzioch commented 10 months ago

Agree about the translation quality. Can't judge for other languages but PL is not very good. I created LORAs for llama_7b and the new tinyllama_1b but most of the model answers in polish are funny but not really useful.

Here's the oasst1 dataset PL translation but I feel it's not good enough to work with really: https://huggingface.co/datasets/mpazdzioch/oasst1_pl

ErikTromp commented 10 months ago

Yes I fully agree - it seems that all languages that cannot be directly translated for (the majority of) the OASST1 dataset, deteriorate in performance heavily. I think there are 2 main paths to explore:

Train our own OPUS additions for missing pairs
Use a different, potentially even better model, like NLLB, Madlad or other T5-based models

I prefer the latter as it would greatly simply the code and hopefully also speed up runtime but we have to check the vRAM requirements and licenses of other models; anything not Apache 2 is basically a no-go and it'd be nice if we can keep allowing people to run on Colab.

It's time we make something of a roadmap but in all honestly we didn't foresee so many people using our work so we are trying to keep up :)

ChrystianSchutz commented 10 months ago

I saw there is also that there is a new model added for Polish right after I stared training, so I will try to use it again to translate dataset and compare: https://github.com/UnderstandLingBV/LLaMa2lang/blob/main/translate_oasst.py https://github.com/UnderstandLingBV/LLaMa2lang/blame/main/translate_oasst.py#L29 "en-pl": 'gsarti/opus-mt-tc-en-pl',

@mpazdzioch I also see that your dataset is larger, probably I made mistake somewhere in translation.

ErikTromp commented 10 months ago

You might try and be bold and use this branch that allows you to optionally use madlad: https://github.com/UnderstandLingBV/LLaMa2lang/tree/L2L-3_translation_alternatives

I just wrote the code and didn't test it but maybe you get lucky and it just works. It is a little heftier on the vRAM so be sure to check quantization and/or tweaking batch size.

ChrystianSchutz commented 10 months ago

You might try and be bold and use this branch that allows you to optionally use madlad: https://github.com/UnderstandLingBV/LLaMa2lang/tree/L2L-3_translation_alternatives I have tested it It it much faster for some parts (45~50 its/s), but on other parts I have (2~3 it/s) the same as before. Current branch:
Writing out checkpoint #7600 for source language en
9%|████████▏                                                                                  | 7960/88838 [03:10<26:46, 50.34it/s]
Writing out checkpoint #15600 for source language en
18%|███████████████▊                                                                        | 16000/88838 [45:20<6:34:10,  3.08it/s]

**Madlad branch:**

Writing out checkpoint #39200 for source language en 44%|███████████████████████████████████████▊ | 39300/88838 [17:02<12:28, 66.15it/s] Writing out checkpoint #400 for source language es 45%|███████████████████████████████████████▋ | 40100/88838 [21:46<4:11:52, 3.23it/s]


The same happens with the older version. About the vram ironicly the madlad is using around 20 gb vram, less then before, as before it was throttling on my 24gb vram (3090ti) using max 24gb.

ErikTromp commented 10 months ago

Not fully sure I get what you are saying because some markup is lost but I can agree madlad seems to be faster, yet you should be careful interpreting these progress bar reports as they overshoot when you continue from previous checkpoints and are also influence by the average length of your current batch, which sometimes might just be very small by coincidence.

Still, madlad is indeed a lot faster and by the looks of it, also better but it does come at a greater vRAM usage. Your 24gb usage throttling is just CUDA geing greedy - I could do OPUS on a 16GB card just fine but not madlad, not even quantized with 20 as batch_size.

laky commented 7 months ago

What would you suggest is a good set of parameters to use with madlad on a 24gb card?

One thing I noticed when using the translation is that it can run well for a while, well under the max memory, and then crash randomly later on as it runs out of vram. What causes these large fluctuations of the vram usage? Is it due to batching not being even, i.e. it seems to depend on the length of the input text, right? Would it make sense to improve the batching to better optimise the vram usage and prevent crashes?

ErikTromp commented 7 months ago

Yes it takes a long time, about 6-8 hours with OPUS, Madlad is heavier so slower. What you are facing is because by default we do not cap the max length of input sequences while batching them through the GPU. You should try capping a fixed size with the max_length parameter.

laky commented 7 months ago

Yeah, the long runtime I can live with, it's more about randomly running out of memory due to the input size after running fine for a few hours.

Does using max_length result in cutting off whatever doesn't fit into the limit? Or is it then processed in the next batch?

ErikTromp commented 7 months ago

It basically truncates the input sentences, which may seem like a bad idea but happens with any LLM wrt to context windows anyway. So right now the length of the sentence pretty much defines how much vRAM needs to be allocated but if you use a fixed max_length/batch_size combination, you cap the amount of vRAM used at any time, which should prevent OoM errors.

An additional benefit of using max_length is that torch will be able to reserve a fixed block of vRAM for every batch, which allows for batched (translation) inference on the GPU which should slightly improve runtimes too.

AI-Commandos / LLaMa2lang

[IDEA] Include a better way to translate dataset? #22