Sentence Classification using BERT/ernie

AyanSinhaMahapatra commented 4 years ago

Ernie - https://github.com/labteral/ernie

Use Cases -

Determining License Class (Text/Notice/Tag/Reference) of Generated Rule Texts
Separating False Positives from License Tags.

Steps -

Fine Tune BERT models on License Texts
Train Models on Scancode Rules (For Determining License Class)
Train Models on Scancode Rules (Separating False Positives in License Tags)
Model Validation and Comparing HyperParameters
Reviewing Results

AyanSinhaMahapatra commented 4 years ago

Preliminary Results on sentence classification with BERT/ERNIE (Pre-Trained Weights Fine-Tuned on Scan code Rule Texts)

Separating False Positives from License Tags.
- Model - BertBaseUncased (Weights 0.5 GB)
- Sentence Length - 8
- Labels - 2 (False Positive/License Tag)
After 4 Epochs of Fine-Tuning with learning rate 2e-5 (6-7 secs each on an RTX 2060)
- accuracy on the training data (90%): 0.9948
- accuracy on the validation data (10%): 0.9479

Results - Successfully Detects Code Fragments with very high Accuracy and confidence. Some inaccuracies in non-code false positives.

Determining License Class (Text/Notice/Tag/Reference) of Generated Rule Texts
- Model - BertBaseUncased (Weights 0.5 GB)
- Sentence Length - 16
- Labels - 4 (License Text/Notice/Tag/Reference)
After 12 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
- accuracy on the training data (90%): 0.9604
- accuracy on the validation data (10%): 0.8398

Results - Gives good results given the nuance of License Cases, more review into the inaccuracies required and this would also help with https://github.com/nexB/scancode-toolkit/issues/2162 to detect faulty tags in .yml files, assuming the model generalizes well from the Training set.

Given the models can generalize better, (i.e. it is overfitting the training cases slightly) More tasks are:-

hyperparameter tuning
try bigger models (BertLargeUncased)
See if Cased Models make any difference (Cased Models differentiate between Uppercase and Lowercase Characters)
Data Augmentation
Better Validation Schemes

AyanSinhaMahapatra commented 4 years ago

@pombredanne @majurg Update. I will create a PR with the Code today.

aboutcode-org / scancode-analyzer

Sentence Classification using BERT/ernie #10