Open petermr opened 9 years ago
ok, I will need to create a regular expression to apply across all *.nwk files that will print (per file) the number of: A) empty tip labels B) non-empty tip labels that do not conform to the GenBank ID standard e.g. "A 123"
On Fri, Aug 7, 2015 at 10:07 AM, Ross Mounce notifications@github.com wrote:
ok, I will need to create a regular expression to apply across all *.nwk files that will print the number of: A) empty tip labels
This isn't necessary. It can be done trivially from the NEXML using xpath:
//nexml:otu[normalize-space(.)='']
will do it.
. This is something that can be put into ami-phylo
B) non-empty tip labels that do not conform to the GenBank ID standard e.g. "A 123"
There is already a tool for running regexes in ami-phylo. (MatchSpecies).
public void matchSpecies(HOCRReader hocrReader) {
if (speciesPattern != null) {
List<HtmlSpan> lines = hocrReader.getNonEmptyLines();
for (HtmlSpan line : lines) {
List<String> matchList = HOCRReader.matchPattern(line,
speciesPattern); LOG.trace((matchList.size() == 0 ? "?? "+HOCRReader.getSpacedValue(line).toString() : matchList)); } } }
I will adapt this to work on otu, rather than direct HOCR output
— Reply to this email directly or view it on GitHub https://github.com/ContentMine/phylotree/issues/6#issuecomment-128648154 .
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
I have calculated that the median number of tips labels missing/empty per file is 4. The mean is 6.8 Minimum 0 (297 files have zero empty tip labels) Maximum 199 ( a rather corrupt looking file: ijs.0.001149-0-003 )
Ross Mounce: almost all newick files have one or more missing tip labels so today I'm just going to plough through adding in the missing labels, manually. After this is done I will do GenBank lookup to go from GB accession number -> GB taxon ID (just to say, this is all version controlled on github, so no change will go undocumented...)
ISSUE: describe this problem precisely and attempt to formulate primary causes