Document Splitting
Data Link : https://www.icloud.com/iclouddrive/081jNFZK6tVBJYKq8V3D1J2rw#Data
Model and Submission link: https://www.icloud.com/iclouddrive/02brnREc6_CVpVWzjO1_fg_oQ#Submission
Final Thesis report: https://www.overleaf.com/5948359792rndwxymmpbjx
bash-3.2$ cd DocumentSplittingChallenge/ bash-3.2$ du -hs * 8.5G corpus1 7.3G corpus2 bash-3.2$ ls */*/* corpus1/TrainTestSet/Testset.zip corpus2/TrainTestSet/Testset.zip corpus1/TrainTestSet/Trainset: Doclengths_of_the_individual_docs_TRAIN.json data VerzoekEnBesluit ocred_text.csv.gz corpus2/TrainTestSet/Trainset: Doclengths_of_the_individual_docs_TRAIN.json ocred_text.csv.gz data
The folder contains 2 traintest sets, corpus1 and corpus2. Both have the same setup:
json
ocred_text.csv.gz
the ocred text from each page in each pdf in the corpus Run python script mlp_classification_train.py with variable parser arguments:
--project [project name]: where the model and logging are stored, default location ./models/mlp_model/[project name]
--model [model name]: choosing model, vgg_lstm (VGG-LSTM) or bert (VGG-BERT)
--weighted [weighted dir]: load pretrained weighted, if you don't want to use pretrained weighted, ignore this argument
--device [select device]: cuda or cpu
--epochs [num epochs]: setting number of training epochs
--batch_size [num batch]: setting batch size
--l_r [learning rate]: setting learning rate
--loss [loss function]: focal or bce
--optim [optimizer]: sam, adam or sgd
--augment [bool]: True or False, augment training data, only class 1
For example, train data with VGG-BERT model, without pretrained weighted, running on GPU, set epochs is 350, batch size 64, learning rate 0.001, focal loss, sam optimizer and augment training data, saving location ./models/mlp_model/train_bert:
python mpl_classification_train.py --model bert --device cuda --epochs 350 --batch_size 64 --l_r 0.001 --loss focal --optim sam --augment True --project train_bert
to change to VGG-LSTM, replace
--model bert
by
--model vgg_lstm
to change to VGG only, replace
--model bert
by
--model vgg_only
to change to LSTM only, replace
--model bert
by
--model lstm_only
to change to BERT only, replace
--model bert
by
--model bert_only
Run python script mlp_classification_train.py with variable parser arguments:
--project [project name]: where the confusion matrix are stored, default location ./models/mlp_model/[project name]
--model [model name]: choosing model, vgg_lstm (VGG-LSTM) or bert (VGG-BERT)
--weighted [weighted dir]: load pretrained weighted
--device [select device]: cuda or cpu
--batch_size [num batch]: test batch size
--mode [mode run]: test or train, default is train
For example, test VGG-BERT model with best vailidation loss model, running on CPU, test batch size 256, weighted location
./models/mlp_model/train_bert/best_val_loss.pt
saving location
./models/mlp_model/test_bert
python mlp_classificaiton_train.py --mode test --project test_bert --device cpu --batch_size 256 --weighted ./models/mlp_model/train_test/best_val_loss.pt