This repository contains a software pipeline to process herbarium specimens. There are currently three tasks this project aims to accomplish:
The run.ipynb
file contains demonstrates how to run each of the three tasks. Additionally, each of the aformentioned tasks is has supporting files and documentation in its respective folder. The primary task is in the transcription
folder (README), the secondary in the vision
folder (README), and the tertiary in the phenology
folder (README). The
The changing climate increases stressors that weaken plant resilience, disrupting forest structure and ecosystem services. Rising temperatures lead to more frequent droughts, wildfires, and invasive pest outbreaks, leading to the loss of plant species. That has numerous detrimental effects, including lowered productivity, the spread of invasive plants, vulnerability to pests, altered ecosystem structure, etc. The project aims to aid climate scientists in capturing patterns in plant life concerning changing climate.
The herbarium specimens are pressed plant samples stored on paper. The specimen labels are handwritten and date back to the early 1900s. The labels contain the curator's name, their institution, the species and genus, and the date the specimen was collected. Since the labels are handwritten, they are not readily accessible from an analytical standpoint. The data, at this time, cannot be analyzed to study the impact of climate on plant life.
The digitized samples are an invaluable source of information for climate change scientists, and are providing key insights into biodiversity change over the last century. Digitized specimens will facilitate easier dissemination of information and allow more people access to data. The project, if successful, would enable users from various domains in environmental science to further studies pertaining to climate change and its effects on flora and even fauna.
The Harvard University Herbaria aims to digitize the valuable handwritten information on herbarium specimens, which contain crucial insights into biodiversity changes in the Anthropocene era.
The main challenge is to develop a transformer-based optical character recognition (OCR) model using deep learning techniques to accurately locate and extract the specimen labels on the samples to preserve them digitally. The secondary task involves building a plant classifier using taxon labels as ground truth, to supplement the OCR model as a source of a priori knowledge. The tertiary goal involves identifying the phenology of the plant specimen under consideration [will be updated after discussing with the client] and possibly predict the biological life cycle stage of the plant. The successful completion of these objectives will showcase the importance of herbaria in storing and disseminating data for various biological research areas. The ultimate goal is to revive and digitize this valuable information to promote its accessibility to the public and future generations.
./corpus
The corpus folder contains the code to generate the corpus file. The corpus file is a .pkl
file of all possible pairs of genus and species. This file is used in transcription to match the extracted text to the corpus.
./CRAFT
The CRAFT folder contains the code to run the CRAFT model. The CRAFT model is used to extract the text from the images and place bounding boxes around the text.
./EDA
The EDA
folder contains an exploritory data analysis of the dataset. The EDA.ipynb
file contains the code to generate the EDA. The EDA_Notebook_Spring_2023.ipynb
file contains the latest output of the EDA.
./scraping
The scraping folder contains the code to scrape the GBIF database to download images for testing and training purposes. The scraping workflow also generates a mock corpus file and ground truth file.
To run transcription, open the dataset folder in the scraping folder and run datasetscraping.py
.
Usage: 'python3 datasetscraping.py <dataset_type> [OPTIONAL ARGS]' where dataset_type is either 'dwca' or 'csv'.
Optional arguments:
-o, --output_path: Specify the output path for the images. Default is './output/'.
-p, --percent_to_scrape: Specify the percentage of the dataset to scrape. Default is 0.00015 (~1198 occurrences).
-u, --dataset_url: Specify the dataset URL to download a new dataset.
-c, --num_cores: Specify the number of cores to use. Default is 50.
-k, --keep: Keep current csv dataset, and do not unzip new dataset.
-h, --help: Print this help message.
./trocr
The trocr folder contains the code to run the TrOCR model. The TrOCR model is used to transcribe the text from the images.