WFO-ID-pilots / text2matrix

0 stars 1 forks source link

dwca2csv.py: Whitespace inserted in middle of words #11

Closed yjkiwilee closed 2 weeks ago

yjkiwilee commented 1 month ago

Running dwca2csv.py, some whitespaces are inserted in the middle of words. For example, in begonia-desc.txt, in line 775:

Subshrub, ca. 3 m high, monoecious, pubescent, with both dendritic greyish trichomes, 0.1–0.4 mm long, and microscopic glandular trichomes. Stem erect, fl eshy, pubescent; internodes 1–3.5 cm long. Stipules 2.5–3 × 0.7–1.5 cm, lanceolate, apex apiculate, margin entire, pubescent, carinate, appressed, caducous. Leaves: petiole 6.3–11.6 cm long, cylindrical, pubescent; blade 13–18 × 19–28 cm, transversally elliptic, deeply lobed (lobes approximately half the length of their main vein), 6 or 7 lobes, asymmetric, basifi xed; base cordate; lobes with acute apex; margin serrulate; pubescent on both surfaces, more densely so on abaxial surface, discolorous, adaxial surface green, abaxial surface green-cinereous; venation actinodromous, 6 or 7 veins at base, slightly thickened. Infl orescence: dichasial cyme 32–39 cm long, ca. 180 flowers; peduncle 23.5– 27 cm long, cinereous; fi rst order bracts 4–6 × 1.5–2.5 mm, lanceolate, apex acuminate, margin entire, caducous. Staminate fl owers: pedicel 1–1.4 cm long, pilose; tepals 4, white, the outer pair larger 6–7.2 × 3–4 mm, ovate to elliptic, apex acute to obtuse, margin entire, concave, glabrescent on abaxial surface, the inner pair 5–6.2 × 1.8–2.3 mm, oblong to oblanceolate, apex obtuse to rounded, margin entire, concave, glabrous; androecium actinomorphic, stamens 32–48, fi laments 0.2–0.9 mm long, free, anthers 1–1.3 mm long, rimose, connective prolonged. Pistillate flowers [not seen]: bracteoles 2, opposite, borne on pedicel, just below ovary, caducous [scars seen on the pedicel from capsules]; styles 3, 1.6–2 mm long, bifi d, branches spirally-arranged, stigmatic papillae covering branches, stigmatic surface papillose, yellow [obtained from capsules]; ovary 5–6.7 mm long, trilocular, placentation axile, placenta entire [observed from capsules]. Capsules 6–7.5 × 11–14.6 mm [including wings], three-winged, glabrescent, brown when mature, dehiscing at the basal portion; wings unequal, larger one 5–7 × 6–7 mm, apex obtuse to rounded, smaller ones 5.8–7 × 0.6–1.6 mm. Seeds ca. 0.3 mm long, oblong.

These then remain in the final json file produced by desc2matrix.py.

yjkiwilee commented 1 month ago

These might have been created by dwca2csv.py (#6) as it removes HTML tags, but I could also try including a line in the prompt for desc2matrix.py to remove these errors.

nickynicolson commented 1 month ago

It looks like these are erroneous spaces in the data. If you open the begonia-desc.txt file in a regular expression capable text editor you can search for the pattern basifi[^x] ie the term that is split is "basifixed", the regular expression looks for the the first part of the split term "basifi" followed by anything that is not "x"