ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

OTU labels are now entirely numerical #49

Closed rossmounce closed 9 years ago

rossmounce commented 9 years ago

Looking through output from my latest code test using the 50-image test set... when NeXML is output (which it isn't always), all the OTU labels are entirely numerical (if present). Are we using a digits-only tesseract whitelist dictionary? Seems like it to me. Sample NeXML output from on file (ijs.0.000653-0-000.pbm.nexml.xml) below:

Perhaps I need to edit a config file in my usr/local ... tessdata to mirror what you have on your machine, Peter?

Full output for all 50 files is on github here: https://github.com/rossmounce/pluto-ONS/tree/master/testing/50-images

<?xml version="1.0" encoding="UTF-8"?>
<nexml xmlns="http://www.nexml.org/2009" xmlns:nex="http://www.nexml.org/2009" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <otus label="RootTaxaBlock">
  <otu id="otu1"/>
  <otu id="otu2">560171118 171777113 11511111 124647 03162681</otu>
  <otu id="otu3">336171113 131136110113 08111 147301 0114221451</otu>
  <otu id="otu4">310171115 50111363111611 0500462681</otu>
  <otu id="otu5">530171118 81081011111738 138114 4851 0064361</otu>
  <otu id="otu6"> 1380171113 1101167113 11116 218371 0415425121</otu>
  <otu id="otu7"> 36611118 1111611 1.1146 218341 0115425091</otu>
  <otu id="otu8">3520111115 11911113 081111 13201 1130211851</otu>
  <otu id="otu9">860171115 9615111111 1.11116 218801 0115513291</otu>
  <otu id="otu10">3670171113 5111111115 11000 17691 01606461</otu>
  <otu id="otu11">3610171113 1602111917313 1011 132851 9113730181</otu>
 </otus>
 <trees>
  <tree id="T1">
   <node id="NT1.1" label="NT1.1" x="0.0" y="457.0" otu="otu1" root="true"/>
   <node id="NT1.2" label="NT1.2" x="244.0" y="375.0"/>
   <node id="NT1.3" x="308.0" y="259.0" label="67"/>
   <node id="NT1.4" x="411.0" y="186.0" label="82"/>
   <node id="NT1.5" x="467.0" y="138.0" label="65"/>
   <node id="NT1.6" label="NT1.6" x="480.0" y="93.0"/>
   <node id="NT1.7" x="493.0" y="330.0" label="99"/>
   <node id="NT1.8" x="553.0" y="375.0" label="85"/>
   <node id="NT1.9" x="641.0" y="413.0" label="99"/>
   <node id="NT1.10" label="NT1.10" x="688.0" y="132.0" otu="otu2"/>
   <node id="NT1.11" label="NT1.11" x="716.0" y="336.0" otu="otu3"/>
   <node id="NT1.12" label="NT1.12" x="721.0" y="439.0" otu="otu4"/>
   <node id="NT1.13" x="725.0" y="55.0" label="100"/>
   <node id="NT1.14" label="NT1.14" x="727.0" y="491.0" otu="otu5"/>
   <node id="NT1.15" label="NT1.15" x="746.0" y="29.0" otu="otu6"/>
   <node id="NT1.16" label="NT1.16" x="756.0" y="81.0" otu="otu7"/>
   <node id="NT1.17" label="NT1.17" x="763.0" y="183.0" otu="otu8"/>
   <node id="NT1.18" label="NT1.18" x="835.0" y="285.0" otu="otu9"/>
   <node id="NT1.19" label="NT1.19" x="847.0" y="234.0" otu="otu10"/>
   <node id="NT1.20" label="NT1.20" x="858.0" y="388.0" otu="otu11"/>
   <edge source="NT1.15" target="NT1.13"/>
   <edge source="NT1.16" target="NT1.13"/>
   <edge source="NT1.10" target="NT1.6"/>
   <edge source="NT1.17" target="NT1.5"/>
   <edge source="NT1.19" target="NT1.4"/>
   <edge source="NT1.1" target="NT1.2"/>
   <edge source="NT1.11" target="NT1.8"/>
   <edge source="NT1.12" target="NT1.9"/>
   <edge source="NT1.18" target="NT1.7"/>
   <edge source="NT1.20" target="NT1.9"/>
   <edge source="NT1.14" target="NT1.2"/>
   <edge source="NT1.13" target="NT1.6"/>
   <edge source="NT1.6" target="NT1.5"/>
   <edge source="NT1.5" target="NT1.4"/>
   <edge source="NT1.4" target="NT1.3"/>
   <edge source="NT1.3" target="NT1.2"/>
   <edge source="NT1.3" target="NT1.7"/>
   <edge source="NT1.7" target="NT1.8"/>
   <edge source="NT1.8" target="NT1.9"/>
  </tree>
 </trees>
</nexml>
petermr commented 9 years ago

This looks like a Tesseract problem. Can you run Tesseract directly on the image and see what it gives?

rossmounce commented 9 years ago

tesseract CLI output on that image:

82

100

— Bacillus nova/is LMG 21837T (AJ542512)

65

— Bacillus vireti LMG 21834T (AJ542509)

Bacillus firmus IAM 12464T (D16268)

67

99

85

99

Bacillus flexus DSM 1320T (ABOZ1185)
Bacillus subti/is NCDO 1769T (X60646)
Bacillus gelatini LMG 21880T (AJ551329)
Bacillus barbaricus DSM 14730T (AJ422145)
Bacillus macauensis JCM 13285T (AY373018)
Bacillus solisalsi YC1T (EU046268)
Bacillus a/ca/ophilus DSM 485T (X76436)

Paenibacillus polymyxa NCDO 1774T (X60632)
rossmounce commented 9 years ago

Solved. My tessdata phylo config file was different. See issue #41