Closed fznx922 closed 8 months ago
This is to be expected. We first take all messages of one source language (say, English) together and translate those in batches. Once done, that vRAM is freed and we proceed with the next source language (say, Spanish). You will see the translation models being loaded in console and brief releases of vRAM.
As long as you keep getting new checkpoint files, you are good.
Just to point out: the language code that OPUS uses for Japanese seems to be jap
instead of jp
.
Hey Erik, thanks for the reply! Originally i had used Jap as my code but i was running into this error, so i tried jp on the seccond pass to check but it was still doing it, as the dataset progresses is fine, the back and forth is on that last step of generation config, it does the first one then just goes up and down not seeming to process, i left it for half an hour and it was still just loading and unloading the ram in a loop it seemed? when it was loading during the run it would only push maybe 2gb or so, but the last one is pulling all 12gb and additional ram and not progressing? hopefully i have explained that well enough as it just doesnt seem like it wants to move forward. could my batch be too high for my card size?
also to add, scanning through the dataset json that it had made seems to be blank? not containing anything, is it default behavior?
thank you for your help :)
Well it does them one language at a time and usually there is a large portion of "rare" languages that only have a few records and it needs to load and unload models for quite frequently but it should never stall. The empty arrays only result in case there is no translation possible, not even through English.
I will see if I can run Japenese some time this or next week to debug.
I do see the Japanese models are a bit larger than average - have you tried using a batch size of 10 instead of 20? vRAM usage seems to go directly over 12GB to stabilize after some GC at exactly 11.7GB so with 12GB vRAM you might just run into a language pair where going through English is going a tad over (in which case you should get a CUDA OoM error instead of livelock as you are describing though).
I am running "jap" now, seems like both "ja" and "jap" are Japanese. I see vRAM going up to 14.6GB but I am not seeing any errors yet, going strong on a 16GB GPU with 20 batch size. I will let it run and publish the datasets once finished.
so ive tried again overnight, when i left it, was running fine on batch = 10 kept my vram usage at 8gb, waking up this morning now its at a crawl and usage is pinged up again, seems like it has progressed but from a point it seems to be really really slow
heres another screenshot, i did note on your original post this would take a long time to produce, is that the case and i just need to wait for now?
with this run looking at the json files it seems to actually have data in there now so i guess it must be something on my end
Yeah 1-1.5s/it is pretty normal. I noticed when doing jap
(almost done btw) that for EN-JAP it is significantly faster, don't know why but I doubt it improves performance as it probably has to do with an underfit model.
I added the base translations for you, you should be able to take it from there. PS. I do think the quality is sub-par, see also this issue
thanks for the reply mate! its quite interesting, mainly im trying to use this method to pull up a models cohearance for jp > en translation, i found while looking on HF some oasst2 japanese datasets, is it possible to use somthing like this for example to feed into it ?
https://huggingface.co/datasets/kunishou/oasst2-135k-ja https://huggingface.co/datasets/kunishou/oasst1-89k-ja
i do see some LLM's have previously been trained on the seccond one
thanks for all your efforts though, has been super fun diving into this on consumer hardware
You should be able to use those yes, if you modify the create threads script to read "text_ja" column instead of "text" you should be good to go already by the looks of it.
If not: I tried madlad for translation, seems to work well for Japanese but it needs to be loaded quantized to work on a 16GB card/Colab.
how were you able to use the madlad dataset for it? did you use this ? allenai/MADLAD-400 or am i overcomplicating it ? i did download the new code set just not sure where to point it currently? thank you :)
It uses google/madlad400-3b-mt, be sure to use --use_madlad and optionally --madlad_quant
Hey there, not sure if its a configuration issue on my end but trying to create a japanese dataset and comes upto the end of the run and starts loading up all my vram, goes till it can fit anymore then dumps and starts again, not sure if its normal behavior? should i just leave it
running command python translate_oasst.py ja ja 500 20
screenshot of attached behavior