KewBridge / specimens2illustrations

1 stars 1 forks source link

Parse figure caption so that we have the illustration parts associated with the specimen used as reference #12

Open nickynicolson opened 12 months ago

nickynicolson commented 12 months ago

Illustration figure captions are provided in the form:

In the XML version of the article the alphabetic sub-illustration identifier (eg A) is shown in bold and the specimen reference (e.g. Saunders 8220) is shown in italics. A sample XML snippet is below:

<caption>
<p>
<italic>
<tp:taxon-name>Solanum agnoston</tp:taxon-name>
</italic>
S.Knapp. (
<bold>A, B</bold>
drawn from
<italic>Holm-Nielsen et al. 5115</italic>
<bold>C</bold>
drawn from
<italic>Jaramillo et al. 8832</italic>
). Illustration by Bobbi Angell.
</p>
</caption>

We should modify xml2illustrationdata.py to parse the captions so that we know:

  1. Which specimen was used as reference for each sub-illustration
  2. How many sub-illustrations we expect to find in each illustration image (helpful for checking #11 )
ErenKarabey commented 12 months ago
<fig id="F35" position="float" orientation="portrait">
<label>Figure 35.</label>
<caption>
<p>
<italic>
<tp:taxon-name>Solanum dulcamara</tp:taxon-name>
</italic>
L. (All drawn from live plants in Battleboro, Vermont, USA). Illustration by Bobbi Angell.
</p>
</caption>
<graphic xlink:href="PhytoKeys-022-001-g035.jpg" position="float" orientation="portrait" xlink:type="simple" id="oo_10291.jpg">
<uri content-type="original_file">https://binary.pensoft.net/fig/10291</uri>
</graphic>
</fig>

This and one more example in the first paper (Knapp S (2013)) does not have bold letters to specify drawings, even though there are more than one plant illustrations.

ErenKarabey commented 12 months ago

Further inconsistency:

35th element obtained via xml2illustrationdata.py (Figure 86) of the same paper:

<caption>
<p>
<italic>
<tp:taxon-name>Solanum seaforthianum</tp:taxon-name>
</italic>
Andrews. (
<bold>A</bold>
drawn from
<italic>Baker 10374</italic>
<bold>B</bold>
H drawn from
<italic>Thompson 947</italic>
<bold>C</bold>
drawn from
<italic>Hatschbach 60388</italic>
<bold>D</bold>
drawn from
<italic>Renderos 517</italic>
; E-g drawn from
<italic>McVaugh 20220</italic>
). Illustration by Bobbi Angell.
</p>
</caption>

'E-g' is not capitalized and not bold. 'B' and 'H' does not have a comma separating them and 'H' is not bold

ErenKarabey commented 12 months ago

In the article (S. Knapp 2016) - the second article - Figure 17:

<caption>
<p>
<italic>
<tp:taxon-name>
<tp:taxon-name-part taxon-name-part-type="genus" reg="Solanum">Solanum</tp:taxon-name-part>
<tp:taxon-name-part taxon-name-part-type="species" reg="madagascariense">madagascariense</tp:taxon-name-part>
</tp:taxon-name>
</italic>
Dunal.
<bold>A</bold>
Flowering branch
<bold>B</bold>
Berry showing thin pericarp through which seeds are visible (based on: A,
<italic>Bosser 16852</italic>
<bold>B</bold>
<italic>Homelle s.n.</italic>
). Adapted from
<xref ref-type="bibr" rid="B25">D’Arcy and Rakotozafy (1994)</xref>
with permission of
<named-content xlink:type="simple" content-type="institution" xlink:href="http://grbio.org/institution/mus%C3%A9um-1" id="NCID0EBKGK">Muséum</named-content>
National d’Histoire Naturelle.
</p>
</caption>

The label 'A' is not bolded in Berry showing thin pericarp through which seeds are visible (based on: A,