aboutcode-org / scancode-analyzer

scancode-results-analyzer
4 stars 2 forks source link

Sentence Classification using BERT/ernie #10

Closed AyanSinhaMahapatra closed 3 years ago

AyanSinhaMahapatra commented 4 years ago

Ernie - https://github.com/labteral/ernie

Use Cases -

  1. Determining License Class (Text/Notice/Tag/Reference) of Generated Rule Texts
  2. Separating False Positives from License Tags.

Steps -

  1. Fine Tune BERT models on License Texts
  2. Train Models on Scancode Rules (For Determining License Class)
  3. Train Models on Scancode Rules (Separating False Positives in License Tags)
  4. Model Validation and Comparing HyperParameters
  5. Reviewing Results
AyanSinhaMahapatra commented 4 years ago

Preliminary Results on sentence classification with BERT/ERNIE (Pre-Trained Weights Fine-Tuned on Scan code Rule Texts)

  1. Separating False Positives from License Tags.

    • Model - BertBaseUncased (Weights 0.5 GB)
    • Sentence Length - 8
    • Labels - 2 (False Positive/License Tag)

    After 4 Epochs of Fine-Tuning with learning rate 2e-5 (6-7 secs each on an RTX 2060)

    • accuracy on the training data (90%): 0.9948
    • accuracy on the validation data (10%): 0.9479

Results - Successfully Detects Code Fragments with very high Accuracy and confidence. Some inaccuracies in non-code false positives.

  1. Determining License Class (Text/Notice/Tag/Reference) of Generated Rule Texts

    • Model - BertBaseUncased (Weights 0.5 GB)
    • Sentence Length - 16
    • Labels - 4 (License Text/Notice/Tag/Reference)

    After 12 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)

    • accuracy on the training data (90%): 0.9604
    • accuracy on the validation data (10%): 0.8398

Results - Gives good results given the nuance of License Cases, more review into the inaccuracies required and this would also help with https://github.com/nexB/scancode-toolkit/issues/2162 to detect faulty tags in .yml files, assuming the model generalizes well from the Training set.

Given the models can generalize better, (i.e. it is overfitting the training cases slightly) More tasks are:-

AyanSinhaMahapatra commented 4 years ago

@pombredanne @majurg Update. I will create a PR with the Code today.