Closed flass closed 3 years ago
@flass did you use dendropy or rapidnj for the tree? The latter will be much faster for a large dataset
it was using the default tool - I assumed it to be RapidNJ (was pretty fast to compute indeed).
The default is to use dendropy, and the changes are probably down to the dendropy's treatment of underscores, which is complex: https://dendropy.org/primer/taxa.html. I would add --rapidnj rapidnj
for faster tree building that I think will also preserve names (assuming rapidnj is on your path).
We should still take a look at the denropy behaviour - have you got a test set of ~10 sequences, with odd names, we could use @flass? @johnlees I think we need to add preserve_underscores=True
to dendropy.PhylogeneticDistanceMatrix.from_csv()
, I can do this as part of the changes to trees.py
today.
Ah do we not use that? Yes, we should add the flag. This will be addressed in #148 then (I think this is all from 'old school' phylogenetic formats being more inflexible with names and their lengths)
@nickjcroucher see the list of names below:
GCA_000387605.1_BJG-01_genomic
RKI-ZBS2-CH129_TACAGC_L002.contigs_spades
GCA_000154005.2_Vibrio_cholerae_623-39_V1_genomic
GCA_006803105.1_ASM680310v1_genomic
220011_4_C4_L003_R1.contigs_velvet
GCA_007623975.1_ASM762397v1_genomic
22776_8#168.contigs_spades
220875_4_C7_L002_R1.contigs_velvet
220871_8_C3_L003_R1.contigs_velvet
SRR7962186.contigs_spades
GCA_003716435.1_ASM371643v1_genomic
GCA_007624145.1_ASM762414v1_genomic
221749_4_C3_L003_R1.contigs_velvet
Fixed in 66ca2d2564d8c34753b261531fb73b14bc7a7784
Hi John, I've got a yet another minor bug to report here:
Versions I am using PopPUNK v2.3.0 with pp-sketchlib 1.6.2, as provided by a conda environment built with
Command used and output returned
Describe the bug When uploading viz output files to microreact, I realised there was something wrong as some genome had their name edited in the NJ tree Newick file.
the name of the genome is
RKI-ZBS2-CH129_TACAGC_L002.contigs_spades
but appears in950Vc_core_NJ.nwk
as:'RKI-ZBS2-CH129 TACAGC L002'
so with underscores replaced by spaces.The other output files (
.csv
and.dot
) have the correct spelling, even though in some it's edited as well to drop the.contigs_spades
suffix: in the950Vc_perplexity20.0_accessory_tsne.dot
file:in the
950Vc_grapetree_clusters.csv
and950Vc_microreact_clusters.csv
files:in the
950Vc_clusters.csv
file:So because of the difference between
950Vc_core_NJ.nwk
and950Vc_microreact_clusters.csv
it leads to a bug when when uploading to Microreact.Weirdly there are many other genomes that have underscores in their name but none other have been replaced. is this due to some specificity of that name that is wrongly parsed when getting rid of the name tail? (it's true it's got dashes and underscores and dots)
In this case it's only one name to correct so it's easy to deal with but I've had it before where many names more were missing/edited, so properly preventing me to enjoy the Microreact viz.
Best,
Florent