The goal is to read the sentence classifier benchmark data in evaluate_model.py. In order to do this, I would like to take advantage of the code that was already written in the [sentence classifier main file]. This script, like many other aspects of PyMoliere, rely on the "checkpoint format" (a directory, containing a __done__ file and one pickle per-partition). In particular, this script relies on three checkpoints: training_data, validation_data, and test_data.
The benchmark provides various sub directories, each containing train.txt, dev.txt, and test.txt. Therefore, this conversion script creates the needed checkpoints from each reciprocal file.
Testing: Converting the data
Command:
./convert_to_classifier_input.py \
--bert_data_dir ../../data/scibert_scivocab_uncased \
--in_data_dir raw_data/PubMed_20k_RCT \
--out_data_dir data
Output:
Prepping embedding
Registered embedding_util:device
Registered embedding_util:tok,model
setup
Converting raw_data/PubMed_20k_RCT/train.txt to data/training_data
Converting to records.
Embedding
Initializing embedding_util:device
Initializing embedding_util:tok,model
5627it [34:24, 2.73it/s]
Converting to sentence_classifier.util.TrainingData
Saving as mock ckpt
Converting raw_data/PubMed_20k_RCT/dev.txt to data/validation_data
... continues for validation and testing
The goal is to read the sentence classifier benchmark data in evaluate_model.py. In order to do this, I would like to take advantage of the code that was already written in the [sentence classifier main file]. This script, like many other aspects of PyMoliere, rely on the "checkpoint format" (a directory, containing a
__done__
file and one pickle per-partition). In particular, this script relies on three checkpoints:training_data
,validation_data
, andtest_data
.The benchmark provides various sub directories, each containing
train.txt
,dev.txt
, andtest.txt
. Therefore, this conversion script creates the needed checkpoints from each reciprocal file.Testing: Converting the data
Command:
Output:
Files Produced:
Testing: Using the data
Command:
Output: