genome name changed in output NJ_tree

flass commented 3 years ago

Hi John, I've got a yet another minor bug to report here:

Versions I am using PopPUNK v2.3.0 with pp-sketchlib 1.6.2, as provided by a conda environment built with

conda create -n poppunk230 -c defaults -c conda-forge -c bioconda poppunk==2.3.0 pp-sketchlib==1.6.2 graph-tool==2.37

Command used and output returned

poppunk_visualise --ref-db /lustre/scratch118/infgen/team216/fl4/poppunk_7kVc/950Vc --output 950Vc --threads 8 \
 --microreact --grapetree

Describe the bug When uploading viz output files to microreact, I realised there was something wrong as some genome had their name edited in the NJ tree Newick file.

the name of the genome is RKI-ZBS2-CH129_TACAGC_L002.contigs_spades but appears in 950Vc_core_NJ.nwk as: 'RKI-ZBS2-CH129 TACAGC L002' so with underscores replaced by spaces.

The other output files (.csv and .dot) have the correct spelling, even though in some it's edited as well to drop the .contigs_spades suffix: in the 950Vc_perplexity20.0_accessory_tsne.dot file:

... "RKI-ZBS2-CH129_TACAGC_L002"[x=24.0984845161438,y=199.08136367797852]; ...

in the 950Vc_grapetree_clusters.csv and 950Vc_microreact_clusters.csv files:

RKI-ZBS2-CH129_TACAGC_L002,96

in the 950Vc_clusters.csv file:

RKI-ZBS2-CH129_TACAGC_L002.contigs_spades,96

So because of the difference between 950Vc_core_NJ.nwk and 950Vc_microreact_clusters.csv it leads to a bug when when uploading to Microreact.

Weirdly there are many other genomes that have underscores in their name but none other have been replaced. is this due to some specificity of that name that is wrongly parsed when getting rid of the name tail? (it's true it's got dashes and underscores and dots)

In this case it's only one name to correct so it's easy to deal with but I've had it before where many names more were missing/edited, so properly preventing me to enjoy the Microreact viz.

Best,

Florent

nickjcroucher commented 3 years ago

@flass did you use dendropy or rapidnj for the tree? The latter will be much faster for a large dataset

flass commented 3 years ago

it was using the default tool - I assumed it to be RapidNJ (was pretty fast to compute indeed).

nickjcroucher commented 3 years ago

The default is to use dendropy, and the changes are probably down to the dendropy's treatment of underscores, which is complex: https://dendropy.org/primer/taxa.html. I would add --rapidnj rapidnj for faster tree building that I think will also preserve names (assuming rapidnj is on your path).

We should still take a look at the denropy behaviour - have you got a test set of ~10 sequences, with odd names, we could use @flass? @johnlees I think we need to add preserve_underscores=True to dendropy.PhylogeneticDistanceMatrix.from_csv(), I can do this as part of the changes to trees.py today.

johnlees commented 3 years ago

Ah do we not use that? Yes, we should add the flag. This will be addressed in #148 then (I think this is all from 'old school' phylogenetic formats being more inflexible with names and their lengths)

flass commented 3 years ago

@nickjcroucher see the list of names below:

GCA_000387605.1_BJG-01_genomic
RKI-ZBS2-CH129_TACAGC_L002.contigs_spades
GCA_000154005.2_Vibrio_cholerae_623-39_V1_genomic
GCA_006803105.1_ASM680310v1_genomic
220011_4_C4_L003_R1.contigs_velvet
GCA_007623975.1_ASM762397v1_genomic
22776_8#168.contigs_spades
220875_4_C7_L002_R1.contigs_velvet
220871_8_C3_L003_R1.contigs_velvet
SRR7962186.contigs_spades
GCA_003716435.1_ASM371643v1_genomic
GCA_007624145.1_ASM762414v1_genomic
221749_4_C3_L003_R1.contigs_velvet

johnlees commented 3 years ago

Fixed in 66ca2d2564d8c34753b261531fb73b14bc7a7784

bacpop / PopPUNK

genome name changed in output NJ_tree #159