ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

Missing tip labels #6

Open petermr opened 9 years ago

petermr commented 9 years ago

Ross Mounce: almost all newick files have one or more missing tip labels so today I'm just going to plough through adding in the missing labels, manually. After this is done I will do GenBank lookup to go from GB accession number -> GB taxon ID (just to say, this is all version controlled on github, so no change will go undocumented...)

ISSUE: describe this problem precisely and attempt to formulate primary causes

rossmounce commented 9 years ago

ok, I will need to create a regular expression to apply across all *.nwk files that will print (per file) the number of: A) empty tip labels B) non-empty tip labels that do not conform to the GenBank ID standard e.g. "A 123"

petermr commented 9 years ago

On Fri, Aug 7, 2015 at 10:07 AM, Ross Mounce notifications@github.com wrote:

ok, I will need to create a regular expression to apply across all *.nwk files that will print the number of: A) empty tip labels

This isn't necessary. It can be done trivially from the NEXML using xpath:

//nexml:otu[normalize-space(.)='']

will do it.

. This is something that can be put into ami-phylo

B) non-empty tip labels that do not conform to the GenBank ID standard e.g. "A 123"

There is already a tool for running regexes in ami-phylo. (MatchSpecies).

public void matchSpecies(HOCRReader hocrReader) {
    if (speciesPattern != null) {
        List<HtmlSpan> lines = hocrReader.getNonEmptyLines();
        for (HtmlSpan line : lines) {
            List<String> matchList = HOCRReader.matchPattern(line,

speciesPattern); LOG.trace((matchList.size() == 0 ? "?? "+HOCRReader.getSpacedValue(line).toString() : matchList)); } } }

I will adapt this to work on otu, rather than direct HOCR output

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/phylotree/issues/6#issuecomment-128648154 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

rossmounce commented 9 years ago

I have calculated that the median number of tips labels missing/empty per file is 4. The mean is 6.8 Minimum 0 (297 files have zero empty tip labels) Maximum 199 ( a rather corrupt looking file: ijs.0.001149-0-003 )