specimens2illustrations

About

This repository includes scripts to process botanical monographs published in the Open Access journal PhytoKeys. The data included in the monograph can be used to build multi-modal datasets, such as:

herbarium specimen images plus the textual descriptions of the species represented
herbarium specimen images plus the scientific illustrations created using the specimen as a reference

Process

graph LR
    subgraph s [" "]
        subgraph s0 ["Article download"]
            doi2xml["<b>make xml</b> For each DOI, <br/>get XML format data<br/><b>Input:</b> doi:10.3897/phytokeys.22.4041 <br/><b>Output:</b> downloads/10.3897/phytokeys.22.4041.xml"]
        end

        subgraph " "
            xmlproc["<b>make txt</b> For each XML file, <br/>extract relevant data and <br/>write to text delimited file<br/><b>Input:</b> downloads/10.3897/phytokeys.22.4041.xml<br/><b>Output:</b> data/10.3897/phytokeys.22.4041.txt"]
            doi2xml--"Initial text processing and image download"-->xmlproc
        end

        subgraph "" 
            txtproc["<b>make captions</b> For each txt file, <br/>parse caption into components and <br/>write to text delimited file<br/><b>Input:</b> data/10.3897/phytokeys.22.4041.txt<br/><b>Output:</b> data/10.3897/phytokeys.22.4041-captions.txt"]
            seg["<b>make segment</b> For each caption set,<br/>read illustration image and segment"]
            xmlproc--"Caption text processing and image segmentation"-->txtproc-->seg
        end
    end

How to use

In remote infrastructure (continuous integration in github actions)

This is the simplest way to run the software and to get an understanding of what it is doing, as you don't need to install any software locally.

Navigate to the actions tab of the repository (see screenshot)
Click Makefile CI in the left hand sidebar, and then on the run workflow dropdown. In the dropdown, click on the green run workflow button (see screenshot)
Wait a moment and you will see a new workflow run appear at the top of the list. (see screenshot)
Click on the workflow to see the steps being executed (see screenshot)
Click on the step build to see the output of the actual commands (see screenshot)
When the build has completed, you can access the products of the build (named data) in the artifacts list at the bottom of the screen (see screenshot)

Locally (on your own machine)

Pre-requisities

You will need a git client in order to get the code from github. To run the software you'll need python installed on your local machine, and the dependency management tool make. See the useful links section below for resources about using and understanding the tool make.

How to run

Open a command line shell
Clone the github repository into a directory on your local machine: git clone git@github.com:KewBridge/specimens2illustrations.git
Change into the new directory: cd specimens2illustrations
Create a virtual environment: python -m venv env
Activate the virtual environment: source env/Scripts/activate
Install python dependencies: pip install -r requirements.txt
Use make to generate the target (the processed text files): make all

Note: as the Makefile is configured to define dependencies between targets, it will first execute commands to download the XML format data using the list of DOIs supplied. (DOI == Digital Object Identifier, a resolvable persistent identifier for a bibliographic work). The DOIs are defined as a variable in the first line of the Makefile. Then the XML format data is processed using xml2illustrationdata.py to generate the processed text file. See comments within the makefile for more details.

Useful links

Make
- Introduction to reproducibility with make
- Manual:
  - File name functions: https://www.gnu.org/software/make/manual/html_node/File-Name-Functions.html
  - Variables: https://www.gnu.org/software/make/manual/html_node/Automatic-Variables.html
  - Special targets (eg .PRECIOUS): https://www.gnu.org/software/make/manual/html_node/Special-Targets.html
Discussions about building multi-modal datasets:
- Specimens and illustrations - discussion #1
- Specimens and textual descriptions (TBC)

Problems / questions

Please raise an issue in the issue tracker for this repository.

Contributing

Contributions are welcome. Please first submit an issue describing the problem being fixed, or the new functionality proposed to be added - and include a reference to the relevant issue in the commit message, for traceability.

Contacts

Nicky Nicolson (n.nicolson@kew.org)

KewBridge / specimens2illustrations

readme