AI-Commandos / LLaMa2lang

Convenience scripts to finetune (chat-)LLaMa3 and other models for any language
Apache License 2.0
259 stars 31 forks source link

translate oasst error japanese #16

Closed fznx922 closed 8 months ago

fznx922 commented 8 months ago

Hey there, not sure if its a configuration issue on my end but trying to create a japanese dataset and comes upto the end of the run and starts loading up all my vram, goes till it can fit anymore then dumps and starts again, not sure if its normal behavior? should i just leave it

running command python translate_oasst.py ja ja 500 20

screenshot of attached behavior

Screenshot 2024-01-03 104221

ErikTromp commented 8 months ago

This is to be expected. We first take all messages of one source language (say, English) together and translate those in batches. Once done, that vRAM is freed and we proceed with the next source language (say, Spanish). You will see the translation models being loaded in console and brief releases of vRAM.

As long as you keep getting new checkpoint files, you are good.

ErikTromp commented 8 months ago

Just to point out: the language code that OPUS uses for Japanese seems to be jap instead of jp.

fznx922 commented 8 months ago

Hey Erik, thanks for the reply! Originally i had used Jap as my code but i was running into this error, so i tried jp on the seccond pass to check but it was still doing it, as the dataset progresses is fine, the back and forth is on that last step of generation config, it does the first one then just goes up and down not seeming to process, i left it for half an hour and it was still just loading and unloading the ram in a loop it seemed? when it was loading during the run it would only push maybe 2gb or so, but the last one is pulling all 12gb and additional ram and not progressing? hopefully i have explained that well enough as it just doesnt seem like it wants to move forward. could my batch be too high for my card size?

also to add, scanning through the dataset json that it had made seems to be blank? not containing anything, is it default behavior?

thank you for your help :)

ErikTromp commented 8 months ago

Well it does them one language at a time and usually there is a large portion of "rare" languages that only have a few records and it needs to load and unload models for quite frequently but it should never stall. The empty arrays only result in case there is no translation possible, not even through English.

I will see if I can run Japenese some time this or next week to debug.

ErikTromp commented 8 months ago

I do see the Japanese models are a bit larger than average - have you tried using a batch size of 10 instead of 20? vRAM usage seems to go directly over 12GB to stabilize after some GC at exactly 11.7GB so with 12GB vRAM you might just run into a language pair where going through English is going a tad over (in which case you should get a CUDA OoM error instead of livelock as you are describing though).

ErikTromp commented 8 months ago

I am running "jap" now, seems like both "ja" and "jap" are Japanese. I see vRAM going up to 14.6GB but I am not seeing any errors yet, going strong on a 16GB GPU with 20 batch size. I will let it run and publish the datasets once finished.

fznx922 commented 8 months ago

so ive tried again overnight, when i left it, was running fine on batch = 10 kept my vram usage at 8gb, waking up this morning now its at a crawl and usage is pinged up again, seems like it has progressed but from a point it seems to be really really slow

heres another screenshot, i did note on your original post this would take a long time to produce, is that the case and i just need to wait for now?

with this run looking at the json files it seems to actually have data in there now so i guess it must be something on my end

Screenshot 2024-01-04 085552

ErikTromp commented 8 months ago

Yeah 1-1.5s/it is pretty normal. I noticed when doing jap (almost done btw) that for EN-JAP it is significantly faster, don't know why but I doubt it improves performance as it probably has to do with an underfit model.

ErikTromp commented 8 months ago

I added the base translations for you, you should be able to take it from there. PS. I do think the quality is sub-par, see also this issue

fznx922 commented 8 months ago

thanks for the reply mate! its quite interesting, mainly im trying to use this method to pull up a models cohearance for jp > en translation, i found while looking on HF some oasst2 japanese datasets, is it possible to use somthing like this for example to feed into it ?

https://huggingface.co/datasets/kunishou/oasst2-135k-ja https://huggingface.co/datasets/kunishou/oasst1-89k-ja

i do see some LLM's have previously been trained on the seccond one

thanks for all your efforts though, has been super fun diving into this on consumer hardware

ErikTromp commented 8 months ago

You should be able to use those yes, if you modify the create threads script to read "text_ja" column instead of "text" you should be good to go already by the looks of it.

If not: I tried madlad for translation, seems to work well for Japanese but it needs to be loaded quantized to work on a 16GB card/Colab.

fznx922 commented 8 months ago

how were you able to use the madlad dataset for it? did you use this ? allenai/MADLAD-400 or am i overcomplicating it ? i did download the new code set just not sure where to point it currently? thank you :)

ErikTromp commented 8 months ago

It uses google/madlad400-3b-mt, be sure to use --use_madlad and optionally --madlad_quant