WFO-ID-pilots / text2matrix

0 stars 1 forks source link

dwca2csv.py: Broken HTML tags remain species descriptions #12

Closed yjkiwilee closed 2 months ago

yjkiwilee commented 3 months ago

For example, in line 5 of begonia-desc.txt:

Monoecious herb. Stem 15–40 cm tall, unbranched, glabrous. Leaves typically 3–5; blade slightly succulent, never coriaceous, subsymmetric, narrowly reniform, reniform or reniform-orbicular, 5–11 × 5–11 cm, margin shortly triangular–lobed, dentate, upper surface usually matt green above with greyish white veins but occasionally concolorous matt green throughout, sparsely to moderately pubescent, lower surface concolorous green or with green veins and reddish tinged intervenal lamina, sparsely to moderately pubescent along veins, and intervenal regions glabrous. Inflorescences solitary to many; peduncle 10–45 cm long. Male flowers: tepals usually pink, rarely white, outer pair usually elliptic to broadly elliptic, occasionally orbicular, ovateelliptic or obovate, 1.5–2.3 × 1.5–2.5 cm, inner pair obovate to spatulate, 2–2.8 × 1.6–1.9 cm. u>Female flowers/u>: tepals same colour as males, outer two elliptic to elliptic-obovate, inner three obovate to spatulate, subequal, 1.3–1.5 × 0.7–0.9 cm; ovary unequally 3-winged with one wing longer than the other two; 3-locular, styles 3.

It's not straightforward to count instances of this issue since < and > are also used as normal less-than or greater-than signs.

nickynicolson commented 3 months ago

This is encoded as u&gt;Female flowers/u&gt; - you could add these (HTML Entities) to your tag stripping code