bdzyubak / torch-control

A top-level repo for evaluating natively available models
MIT License
2 stars 0 forks source link

Evaluate - Fine tuning the entire LLM network vs default classifier head vs bigger head #31

Closed bdzyubak closed 4 months ago

bdzyubak commented 4 months ago

The typical approach is to freeze all layers except for a small default head e.g. 768 channels for distilBERT. For this issue, test how does training the full network (assuming it fits in memory) compare to training the default head to a larger custom head.

Metrics: validation accuracy training accuracy training time

bdzyubak commented 4 months ago

Training of the full network takes 46 minutes and yields 0.88 validation accuracy on binarized positive/negative sentiments projects\NaturalLanguageProcessing\MovieReviewAnalysis\fine_tune_on_kaggle_movie_sentiment.py Commit: f0367f3c76cf38c790af43c25abc059d00557ba9

train_acc: 98.6% val_acc: 86.1% time: 46.7 min

image

bdzyubak commented 4 months ago

Training a larger classifier head with 768x768x768x1 parameters takes the same amount of time (other processes were running in parallel, doubtful it is actually slower than training the full network) and yields inferior accuracy to the full network.

train_acc 83.7% val_acc 83.4% time: 50.7 min

image

bdzyubak commented 4 months ago

Default head 768x768 ~76k parameters training_acc: 0.851 val_acc: 0.846 time: 50.2 m

image

bdzyubak commented 4 months ago

Adding 10 additional layers does not improve accuracy. Variability between reruns appears to be about 2% image

bdzyubak commented 4 months ago

Overall, training with the original fine tune head provides the same accuracy as increasing the size of the head by a factor of 10x to 3M. Training the entire network with 60 M parameters provides the highest train accuracy, and highest val accuracy due to checkpointing successfully dealing with overfitting.

Conclusion: train the full network with checkpointing on val accuracy.

Note: surprising that fine tuning with 300k and 6 M parameters in multiple linear layers with activation and 0 dropout results in the same 0.83-0.85 accuracy, while training the full network does improve train accuracy to 0.98. This appears to indicate that some weights in the initial layers are wrong and need to be updated for optimal performance, while having pretrained looses input information to where it is not recoverable by deeper layers.