FartyPants / Training_PRO

Traing PRO extension for oobabooga WebUI - recent dev version
MIT License
46 stars 2 forks source link

Training_PRO

This is an expanded and reworked Training tab - the very latest and newest version Maintained by FP

ko-fi

Repo home:

https://github.com/FartyPants/Training_PRO

In general the repo is WAY ahead (could be a few months) of the Training PRO extension included in text WebUi. The idea is to keep the extension supplied with the WebUI well tested and stable, while the repo version adds many experimental features that could change shape in further weeks.

Installation:

Since a stable version of Training PRO is included in WebUI, to avoid issues with WebUI updates, put this repo in Training_PRO_wip folder and use Training_PRO_wip in Session, instead of the supllied Traing PRO that comes with WebUI

Clone repo to your extensions folder in Training_PRO_wip

cd text-generation-webui\extensions\
git clone https://github.com/FartyPants/Training_PRO Training_PRO_wip

Now use the Training_PRO_wip in Session, instead of the supllied Traing PRO

image

Blog post

How to convert JSON to JSONL to be used with jinja embedded templates https://ko-fi.com/post/Convert-JSON-to-JSONL-in-Training-PRO-W7W2100SMQ

News

July 2024

February 2024

image

Features/Changes from Main Training in WebUI

Notes:

This uses it's own chunking code for raw text based on sentence splitting. This will avoid weird cuts in the chunks and each chunk should now start with sentence and end on some sentence. It works hand in hand with Hard Cut. A propper use is to structure your text into logical blocks (ideas) separated by three \n then use three \n in hard cut. This way each chunk will contain only one flow of ideas and not derail in the thoughts. And Overlapping code will create overlapped blocks on sentence basis too, but not cross hard cut, thus not cross different ideas either. Does it make any sense? No? Hmmmm...

Custom schedulers

A bunch of custom (combination) schedulers are added to the LR schedule. These are based on my own experiments

FP_low_epoch_annealing

Uses constant LR (with warmup) for 1 epoch only. The rest of the epoch(s) is cosine annealing. So 10 epochs - 1 will be constant 9 will be nose dive down. However a typical usage would be 2 epochs (hence low epoch in name). 1st is constant, the second is annealing. Simple. I use it 90% of time.

FP_half_time_annealing

Like the low epoch, but now the total number of steps is divided by 2. First half is constant, second half is annealing. So 10 epochs - 5 will be constant, 5 will be cosine nose down.

FP_raise_fall_creative

This is a sine raise till half of the total steps then cosine fall the rest. (Or you may think of the curve as sine in its entirety. The most learning is done in the hump, in the middle. The warmup entry has no effect, since sine is automatically warm up. The idea is to start very mildly as not to overfit with the first blocks of dataset. It seems to broaden the scope of the model making it less strict for tight dataset.

Targets

Normal LORA is q, v and that's what you should use. You can use (q k v o) or (q k v) and it will give you a lot more trainable parameters. The benefit is that you can keep rank lower and still attain the same coherency as q v with high rank. Guanaco has been trained with QLORA and q k v o for example and they swear by it.

DEMENTOR LEARNING (experimental) Deep Memorization Enforcement Through Overlapping and Repetition

This is and experimental chunking to train long-form text in low number of epochs (basically 1) with sliding repetition. The depth of learning directly depends on the cutoff_length. Increasing cutoff length will also increase number of blocks created from long-form text (which is contrary to normal training). It is based on my own wild experiments.

Getting rid of batch size and micro batch size

Keeping consistency with everyone else.

Listen, There is only ONE batch size - the True batch size (called previously micro-batch size in WebUI) - this is how many blocks are processed at once (during a single step). It eats GPU, but it really helps with the quality training (in fact the ideal batch size would be the same as number of blocks - which is unrealistic) - so the idea is to cram as much True Batch Size before your GPU blows with OOM. On 24GB this is about 10 for 13b (loaded with 4-bit)

So no micro batch size - it is now called True Batch Size, because that's what it is.

The other thing is Gradient Accumulation - this is an emulation of the above Batch size - a virtual batch size, if you will. If your GPU can't handle real batch size then you may fake it using Gradient Accumulation. This will accumulate the gradients over so many steps defined here and then update the weights at the end without increase in GPU. Gradient accumulation is like a virtual Batch size multiplier without the GPU penalty.

If your batch size is 4 and your gradient accumulation is 2 then it sort of behaves as if we have batch size 8. Sort of because Batch size of 4 and GA of 2 is NOT the same as batch size of 2 and GA of 4. (It produces different weights - hence it's not an equivalent). The idea is that if you don't have GPU - using GA to extend batch size is the next best thing (good enough) since you have no other choice.

If all you can afford is 1 batch size, then increasing GA will likely make the learning better in some range of GA (it's not always more is better).

However - GA is not some golden goose. As said, it isn't the same as batch size. In fact GA may worsen your learning as well.

I would suggest a series of experiment where you would put batch size as high as possible without OOM, set GA 1, then repeat training while increasing the GA (2, 4...), and see how the model changes. It's likely that it would follow some sort of curve where GA will seem to help before it will make it worse. Some people believe that if you can squeeze 6 BATCH Size, then you should not bother with GA at all... YMMW

High Batch Size vs High GA would also likely produce different results in terms of learning words vs style. How? Hmmmm... good question.

One optical "benefit" of GA is that the loss will fluctuate less (because of all the gradient accumulation, which works as a form of noise smoothing as well).

Eliminating bad blocks

If you use JSON and a block is longer than Maximum context length (Cutoff) by default it will be trimmed to the Maximum context length (Cutoff). That's the default behavior. While it may work on some cases where the end of the block is not much more important than the beginning, in some other cases this may create a really bad situation. Imagine if you are training a text labeling system where you train it on something like this:

USER: Determine the type of text: ... some long text...

ASSISTANT: Poetry

In such case trimming the block will probably cutoff the entire answer thus making the block useless. Not only that, also skewing the whole functionality where the model may learn that entering long text means no answer will be given.

Elimninate cutoff blocks option will simply not use such block at all. No block is much preferable than having a bad block.

This options apply only to JSON dataset for obvious reasons.

Also watch the terminal window to see how many blocks you had elimitnated - if it is too many, then you need to increase Maximum context length (Cutoff).

Group Samples by Length

Makes the training more efficient as it will group blocks with simillar size into one batch. The effect can be observed on the loss graph, because the loss will became much more oscillating. This is due to the fact that different block lengths have different loss and when grouped the effect is amplified.

image

JSONL datasets

Those are the new type of datasets that have role defined. They expect the jinja Template. The format is:

[ { "messages": [ {"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."} ] },

The format will be chosen autmatically from the chat Template embedded in the tokenizer metadata. If no format is specified (legacy) then the jinja instruction template specified in WebUI will be used

hybrid training

Did you wonder what would happen if you train partially on raw text and also on instruct dataset? Now you can with Hybrid training. Simply select both files - dataset (and format) and raw text file. And go! What this will give you - IDK. Experiment. But in general it can stylize instruct response with the writing style of the raw text file. (Of course the correct ratio matters!) Or do some other damage. Now go and test it.