Converts sentence classifier benchmark data into a training data checkpoint

The goal is to read the sentence classifier benchmark data in evaluate_model.py. In order to do this, I would like to take advantage of the code that was already written in the [sentence classifier main file]. This script, like many other aspects of PyMoliere, rely on the "checkpoint format" (a directory, containing a __done__ file and one pickle per-partition). In particular, this script relies on three checkpoints: training_data, validation_data, and test_data.

The benchmark provides various sub directories, each containing train.txt, dev.txt, and test.txt. Therefore, this conversion script creates the needed checkpoints from each reciprocal file.

Testing: Converting the data

Command:

./convert_to_classifier_input.py \
  --bert_data_dir ../../data/scibert_scivocab_uncased \
  --in_data_dir raw_data/PubMed_20k_RCT \
  --out_data_dir data

Output:

Prepping embedding
Registered embedding_util:device
Registered embedding_util:tok,model
setup
Converting raw_data/PubMed_20k_RCT/train.txt to data/training_data
Converting to records.
Embedding
Initializing embedding_util:device
Initializing embedding_util:tok,model
5627it [34:24,  2.73it/s]                                                                            
Converting to sentence_classifier.util.TrainingData
Saving as mock ckpt
Converting raw_data/PubMed_20k_RCT/dev.txt to data/validation_data
... continues for validation and testing

Files Produced:

./testing_data
./testing_data/__done__
./testing_data/part-0.pkl
./validation_data
./validation_data/__done__
./validation_data/part-0.pkl
./all_data
./all_data/__done__
./training_data
./training_data/__done__
./training_data/part-0.pkl

Testing: Using the data

Command:

python3 -m pymoliere.ml.sentence_classifier \
  configs/sentence_classifier.conf \
  --custom_data_dir benchmarks/pubmed_sentence_classifier/data/

Output:

Running pymoliere sentence_classifier with the following parameters:
shared_scratch: "/scratch4/jsybran/pymoliere_scratch"
custom_data_dir: "benchmarks/pubmed_sentence_classifier/data/"

Running on local machine!
Checkpoint: all_data
Prepping model
Model -> Device
Loading Model
Loading test data from benchmarks/pubmed_sentence_classifier/data
100%|██████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
Evaluation
Generating Predictions
942it [00:00, 1103.05it/s]                                                                           
Accuracy: 0.8242575078812012
abstract:background
  - Precision: 0.71220
  - Recall:    0.78183
  - F1:        0.74539
  - Support:   3621
  - Mistakes:
    - abstract:objective 354
    - abstract:methods 220
    - abstract:conclusions 175
    - abstract:results 41
abstract:conclusions
  - Precision: 0.77010
  - Recall:    0.97025
  - F1:        0.85866
  - Support:   4571
  - Mistakes:
    - abstract:results 107
    - abstract:background 29
    - abstract:methods 0
    - abstract:objective 0
abstract:methods
  - Precision: 0.90360
  - Recall:    0.82591
  - F1:        0.86301
  - Support:   9897
  - Mistakes:
    - abstract:results 1418
    - abstract:conclusions 152
    - abstract:background 134
    - abstract:objective 19
abstract:objective
  - Precision: 0.76059
  - Recall:    0.50793
  - F1:        0.60910
  - Support:   2333
  - Mistakes:
    - abstract:background 975
    - abstract:methods 155
    - abstract:results 17
    - abstract:conclusions 1
abstract:results
  - Precision: 0.83842
  - Recall:    0.84567
  - F1:        0.84203
  - Support:   9713
  - Mistakes:
    - abstract:conclusions 996
    - abstract:methods 497
    - abstract:background 6
    - abstract:objective 0

JSybrandt / agatha

Converts sentence classifier benchmark data into a training data checkpoint #1

Testing: Converting the data

Testing: Using the data