BU-Spark / ml-herbarium

Herbaria ML
15 stars 12 forks source link

Build classifier for typed vs handwritten text #108

Open funkyvoong opened 1 year ago

trgardos commented 1 year ago

Collecting some candidates for handwritten datasets:

  1. https://fki.tic.heia-fr.ch/databases/iam-handwriting-database
  2. https://cvl.tuwien.ac.at/research/cvl-databases/an-off-line-database-for-writer-retrieval-writer-identification-and-word-spotting/

Another option is to synthesize handwritten data. See ScrabbleGAN: https://github.com/amzn/convolutional-handwriting-gan

funkyvoong commented 1 year ago

Herbarium of the Future Paper: https://www.cell.com/trends/ecology-evolution/fulltext/S0169-5347(22)00295-6 Detecting Handwritten and Printed Text from Doctor's Notes: https://www.proquest.com/docview/2505259735?pq-origsite=gscholar&fromopenview=true

kabilanmohanraj commented 1 year ago

Updates (27th June 2023):

Datasets in use:

Unused for now:

  1. COCO-Text (need to filter usable images strictly)
  2. CVL-HW (need to segment lines of text using CRAFT, working on this now)
  3. CVIT-HW (the text looks too organized, like printed text, and does not closely represent our data)

Work done:

  1. I went through readings about preprocessing steps for OCR. The aspect ratio of our image plays a major role, so I plotted the statistic to understand its distribution, based on which I have written specific resizing and cropping transforms for each of the datasets we are currently using.
  2. Populated test set with more images (working to increase the number of samples even more).
  3. Testing out DenseNet model as an alternative to VGG16 because VGG16 overfits very quickly.

Model Performance: Our pipeline accuracy is capped at 80% now. I have identified problematic images which the model misclassified, and I am working on adding more suitable images to counter this problem.

I think this is a good direction, as the COCO-Text dataset has not been included this time (last week, it was part of the training set). Individual preprocessing of each dataset has been proven effective, as our model's performance hasn't dropped much without the dataset.

Current Tasks:

  1. Add samples to test dataset
  2. Evaluate DenseNet model performance
  3. Preprocess CVL-HW dataset
  4. [Prof. Langdon] Add more samples with varied fonts to Typed text (Text+Font -> PDF -> Images)
kabilanmohanraj commented 1 year ago

Updates (30th June 2023):

  1. Added synthetic font data with different font sizes and styles (LibreOffice UNO API -> ODT file -> PDF file -> JPG image -> CRAFT -> individual images). For sample data, please refer here [files] [images]. Adding the new data increased the accuracy score.
  2. Modifications to the preprocessing pipeline - added erosion (morphological operation)
  3. DenseNet121 model accuracy is over 88% (F1 score > 0.9). Unlocked more layers to fine-tune. Tuning the number of layers to unlock.
  4. Discussion with Freddie: a. Focus more on the data b. Metrics to classify plant sample images into hw or typed (looking into this) c. Pointers on DocAI models (Hugging Face, model distillation) (looking into this as well)
kabilanmohanraj commented 1 year ago

Updates (3rd July 2023):

  1. Added more test images. Over 100 images were handpicked.
  2. Both DenseNet121 and a custom VGG16-like model (trained from scratch) exhibit an accuracy of 88%. DenseNet is a little bit lower in accuracy.
  3. Tried out some Document AI models hosted on Hugging Face. They don't identify labels from samples that well. Looking to label some samples to fine-tune such models.
  4. Working on the post-processing pipeline to classify plant samples based on the average confidence scores for each segmentation classification. Scripting is almost done.
kabilanmohanraj commented 1 year ago

Updates (24th July 2023):

  1. Classification task: 1.1 Implemented a Transformer classifier exhibiting a classification accuracy of about 96%. 1.2 Attempted to implement an AdaBoost-type training with a simple neural network. The accuracy was only about 64%. 1.3 Updated the post-processing (plant sample classification) step with the Transformer-based pipeline from 1.1. The classified images are in their respective folders. 1.4 Did readings on Transformers, attention mechanism, multi-head attention, TrOCR, and hugging face implementation of the TrOCR pipeline (from 17th-24th July). 1.5 Code commenting.