Closed AyanSinhaMahapatra closed 3 years ago
Preliminary Results on sentence classification with BERT/ERNIE (Pre-Trained Weights Fine-Tuned on Scan code Rule Texts)
Separating False Positives from License Tags.
After 4 Epochs of Fine-Tuning with learning rate 2e-5 (6-7 secs each on an RTX 2060)
Results - Successfully Detects Code Fragments with very high Accuracy and confidence. Some inaccuracies in non-code false positives.
Determining License Class (Text/Notice/Tag/Reference) of Generated Rule Texts
After 12 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
Results - Gives good results given the nuance of License Cases, more review into the inaccuracies required and this would also help with https://github.com/nexB/scancode-toolkit/issues/2162 to detect faulty tags in .yml files, assuming the model generalizes well from the Training set.
Given the models can generalize better, (i.e. it is overfitting the training cases slightly) More tasks are:-
@pombredanne @majurg Update. I will create a PR with the Code today.
Ernie - https://github.com/labteral/ernie
Use Cases -
Steps -