Closed rossmounce closed 9 years ago
This looks like a Tesseract problem. Can you run Tesseract directly on the image and see what it gives?
tesseract CLI output on that image:
82
100
— Bacillus nova/is LMG 21837T (AJ542512)
65
— Bacillus vireti LMG 21834T (AJ542509)
Bacillus firmus IAM 12464T (D16268)
67
99
85
99
Bacillus flexus DSM 1320T (ABOZ1185)
Bacillus subti/is NCDO 1769T (X60646)
Bacillus gelatini LMG 21880T (AJ551329)
Bacillus barbaricus DSM 14730T (AJ422145)
Bacillus macauensis JCM 13285T (AY373018)
Bacillus solisalsi YC1T (EU046268)
Bacillus a/ca/ophilus DSM 485T (X76436)
Paenibacillus polymyxa NCDO 1774T (X60632)
Solved. My tessdata phylo
config file was different. See issue #41
Looking through output from my latest code test using the 50-image test set... when NeXML is output (which it isn't always), all the OTU labels are entirely numerical (if present). Are we using a digits-only tesseract whitelist dictionary? Seems like it to me. Sample NeXML output from on file (ijs.0.000653-0-000.pbm.nexml.xml) below:
Perhaps I need to edit a config file in my usr/local ... tessdata to mirror what you have on your machine, Peter?
Full output for all 50 files is on github here: https://github.com/rossmounce/pluto-ONS/tree/master/testing/50-images