biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

phylophlan_strain_finder - wrong threshold arguments #17

Closed VadimDu closed 4 years ago

VadimDu commented 4 years ago

Dear Francesco,

First, thank you for the very useful and configurable tool! Great job. I run into an into a "Namespace" error when tried to use phylophlan_strain_finder tool: _File "/urigo/vadimd/conda_phylophlan3/lib/python3.8/site-packages/phylophlan/phylophlan_strain_finder.py", line 168, in phylophlan_strain_finder check_params(args, args.verbose) File "/urigo/vadimd/conda_phylophlan3/lib/python3.8/site-packages/phylophlan/phylophlan_strain_finder.py", line 114, in check_params if args.p_threshold < 0.0: AttributeError: 'Namespace' object has no attribute 'pthreshold'

In the documentation under "Finding strains in trees" part, you wrote that the thresholds can be tuned using: --phylo_thr and --mutrate_thr , however under "phylophlan_strain_finder.py" you have 2 different arguments instead: --p_threshold P_THRESHOLD and --m_threshold M_THRESHOLD. I have checked phylophlan_strain_finder.py, you added --phylo_thr and --mutrate_thr as argparse arguments, but in check_params function you checked for:

if args.p_threshold < 0.0:
        error('p_threshold should be a positive number', exit=True)
if args.m_threshold < 0.0:
        error('m_threshold should be a positive number', exit=True)

which are not defined and hence was the error. I have replaced these 2 arguments instead of --phylo_thr and --mutrate_thr in argparse and the script seems to work fine with the default threholds of 0.05.

In addition I wanted to ask you how do you recommend to read/interpretate the output table from this scripts? It's not easy readable in the current output format.

Thank you Vadimd

fasnicar commented 4 years ago

Dear Vadimd,

Thank you very much for reporting this. I fixed the params in the code and will update the conda package in the next weeks.

About the output interpretation, your output file should be something like the following:

#phylogenetic_threshold 0.05
#mutation_rate_threshold    0.05
#total_branch_length    ###
#subtree    min_dist    mean_dist   max_dist    min_mut mean_mut    max_mut distances   mutation_rates

Where: subtree: is the subtree in newick format with all leaves within the two thresholds min_dist: is the minimum phylogenetic distance for that subtree mean_dist: is the average phylogenetic distance for that subtree max_dist: is the maximum phylogenetic distance for that subtree min_mut: is the minimum mutation rate for that subtree mean_mut: is the average mutation rate for that subtree max_mut: is the maximum mutation rate for that subtree distances: is all phylogenetic distances in that subtree mutation_rates: is all mutation rates in that subtree

Please let me know if I can help you with anything else.

Many thanks, Francesco

VadimDu commented 4 years ago

Dear Francesco,

Thank you for the quick response and handling of the issue.

I appreciate your explanation of strain finder output table interpretation. I still might be missing something however. I assumed the output should be phylogenetic distances and mutations rate between nodes (our genomes) in each subtree to help decide whether each subtree represent a different strain. My output file have only 1 very long row in newick format (besides the file headers), separated by commas and pipes, without any apparent results columns you have mentioned. According to your code, the separator in the output should be '\t' by default: p.add_argument('-s', '--separator', type=str, default='\t', choices=OUTPUT_EXTENSIONS.keys(), help='Specify the separator to use in the output') However, running phylophlan_strain_finder showing this massage regarding the separator: -s {;,,, }, --separator {;,,, } Specify the separator to use in the output (default: ) Only the headers in the output file (#subtree"\t"min_dist"\t"mean_dist...) are tab separated. Is there might be some inconsistency in the representation?

Thanks a lot again, Vadimd

fasnicar commented 4 years ago

Dear Vadimd,

This is strange. To better understand your issue can I ask you to do the following:

  1. run phylophlan_metagenomic again with the --verbose option and saving the output to a log file and attached it here
  2. attach here also the output file generated

Many thanks, Francesco

VadimDu commented 4 years ago

Hi Francesco, Sure no problem, I guess you meant the phylophlan_strain_finder (not phylophlan_metagenomic), here is the output from the command with --verbose:

phylophlan_strain_finder.py version 3.0.8 (8 May 2020) Command line: /urigo/vadimd/conda_phylophlan3/bin/phylophlan_strain_finder --input phylophlan3_output/RAxML_bestTree.Ecoli_MAG_isolates_good_quality_all_datasets_n450_refined.tre --mutation_rates phylophlan3_output/mutation_rates.tsv --output phylophlan3_output/Ecoli_MAG_isolates_good_quality_all_dataset_n450_UniRef90_95core_strain_finder --verbose Checking for parameters... Arguments: {'input': 'phylophlan3_output/RAxML_bestTree.Ecoli_MAG_isolates_good_quality_all_datasets_n450_refined.tre', 'mutation_rates': 'phylophlan3_output/mutation_rates.tsv', 'p_threshold': 0.05, 'm_threshold': 0.05, 'tree_format': 'newick', 'output': 'phylophlan3_output/Ecoli_MAG_isolates_good_quality_all_dataset_n450_UniRef90_95core_strain_finder', 'overwrite': False, 'separator': '\t', 'verbose': True} Reading mutation_rates table... Root reached, return Clade as root of the subtree Root reached, return GCF008082325.1_isolate_WGS as root of the subtree Root reached, return M198_MAG_assembly as root of the subtree Creating output...

I will send you the output with the results over the email if it's OK.

Thanks a lot, Dani

fasnicar commented 4 years ago

Dear Dani, yes I meant phylophlan_strain_finder and not phylophlan_metagenomic, sorry. Yes, the output file by email is fine, thank you.