ContentMine / phylotree

A repository for ami-phylotree development
0 stars 0 forks source link

preprocessing of newick for STK #19

Open petermr opened 9 years ago

petermr commented 9 years ago

Ross Mounce wrote:

##Pre-processing steps

#Put all Newick trees into one file, one per line
for i in *.nwk ; do cat $i >> testree.tre ; sed -i -e '$a\' testree.tre ; done
#Remove Newick strings that are 70 characters or less. This selects for trees containing four labelled taxa or more
awk 'length($0)>70' testree.tre > filteredtrees.tre
#Flatten from unicode to ascii text
iconv -f utf-8 -t ascii//translit filteredtrees.tre -o asciitrees.tre
# grep -P '[^\x00-\x7f]' filteredtrees.tre to see the unicode chars
#Substitute hyphens
sed -i 's/-//g' asciitrees.tre
# stk/p4 does not like take with labels solely composed of numbers
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
# stk/p4 does not like unmatched ' symbols
sed -i 's/'\''//g' asciitrees.tre
# stk/p4 does not like unmatched " symbols
sed -i 's/[\"]//g' asciitrees.tre
# stk/p4 does not like / symbols
# stk/p4 does not like taxa beginning with . symbol
# stk/p4 does not like taxa beginning with , symbol

I'll comment with suggested actions

petermr commented 9 years ago
#Remove Newick strings that are 70 characters or less. This selects for trees containing four labelled taxa or more
awk 'length($0)>70' testree.tre > filteredtrees.tre

This can be done precisely with Xpath

count(//otu[not(normalize-space(.)='')])

It's very important to record this in an audit trail as it alters the number of trees used.

petermr commented 9 years ago
#Flatten from unicode to ascii text
iconv -f utf-8 -t ascii//translit filteredtrees.tre -o asciitrees.tre
# grep -P '[^\x00-\x7f]' filteredtrees.tre to see the unicode chars

if trees require Unicode flattening this c/should be an option in AMI. Could be made the default.

petermr commented 9 years ago
#Substitute hyphens
sed -i 's/-//g' asciitrees.tre

Note that some strains contain a "-"

# stk/p4 does not like take with labels solely composed of numbers
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre
sed -i 's/\([(,]\)[0-9]\([0-9][0-9][0-9][0-9][0-9][0-9][0-9]:\)/\1C\2/g' asciitrees.tre

Is this a STK restriction or a Newick restriction? We can, if necessary have dialect flags on ami-phylo output of Newick.

# stk/p4 does not like unmatched ' symbols
sed -i 's/'\''//g' asciitrees.tre

I am removing the "smart quotes" so no ' or smart quote should be emitted from ami.

# stk/p4 does not like unmatched " symbols
sed -i 's/[\"]//g' asciitrees.tre

these should be forbidden symbols

# stk/p4 does not like / symbols
# stk/p4 does not like taxa beginning with . symbol
# stk/p4 does not like taxa beginning with , symbol

The approach will be to require ami-phylo only to output "valid" Newick. I am about to post my validator/correctors.

petermr commented 9 years ago

Looking at your concatenated tree there are clearly a lot of garbles. Mainly "O" for "0" in tip labels. Can you produce a schematic workflow or all the potential errors for ami-phylo that will allow us to trap them at appropriate stages. (sed-ing the Newick file is bound to lead to problems in the medium, if not short term). It's analogous to parsing bad HTML with regexes (see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 ).

petermr commented 9 years ago

Are branch-lengths with real numbers (i.e. with decimal points) errors in Newick, or simply a variation between programs? Many programs seem to accept them.

petermr commented 9 years ago

http://www.nexml.org/nexml/ has a "validator" (actually a validating? converter I think). It's probably a good community effort at validation.

Do the R packages accept nexml? That could be the simplest way to make sure we are working with valid material.

petermr commented 9 years ago

Ross Mounce: From 274 source trees it created an MRP matrix of these dimensions: dimensions ntax = 3734 nchar = 3798; assuming ~15 taxa per input tree there are perhaps ~4110 taxa across all those 274 tree so a reduction to just 3734 does suggest that there is very little overlap. (Mostly because Genbank IDs != species & OCR errors create different taxa) the more trees and the more taxon label standardisation / consolidation, the better it'll be I will work on a script to calculate taxon occurrence /overlap over all trees in a set

singleton taxa that only appear in one tree just bloat the analysis. We may well want to remove these unicorns

PMR: This suggests we should be comparing taxa (not IDs) as soon as possible. May be valuable to sort taxa alphabetically and look for near misses.