BDI-pathogens / phyloscanner

Phylogenetics between and within hosts at once, all along the genome.
GNU General Public License v3.0
44 stars 14 forks source link

PhyloscannerR error: Some trees have duplicate IDs #49

Closed damientully closed 5 years ago

damientully commented 5 years ago

Hi Matthew,

I just upgraded to v1.8.0 and seem to be getting an issue when running phyloscannerR. The trees were built with the phyloscanner_make_trees.py script and it seems to be complaining about trees having duplicate IDs.

Any suggestions? Thanks, Damien

Warning: package ‘argparse’ was built under R version 3.4.4
Warning: package ‘scales’ was built under R version 3.4.4
Warning: package ‘readr’ was built under R version 3.4.4
Random number seed is 246884906 
Initialising...
Error in initialise.phyloscanner(tree.file.directory, tree.file.regex,  : 
  Some trees have duplicate IDs.
Calls: phyloscanner.analyse.trees -> initialise.phyloscanner
Execution halted
mdhall272 commented 5 years ago

Hi Damien,

Any chance you could send the command line and a minimal set of files for which this happens? It's not happening for me.

Thanks, Matthew

damientully commented 5 years ago

Just sent you an email with everything.

mdhall272 commented 5 years ago

Blocked by Oxford IT due to "detection of malicious content". Dropbox maybe?

mdhall272 commented 5 years ago

Hi Damien,

This is due to a change in how window coordinates in file names are handled - it used to be that everything after the RAxML_InWindow. would be used as IDs, but now it looks for a string of the form XXXX_to_YYYY in all circumstances. Because you have many bootstrapped trees in that folder from the same window, it's seeing duplicate IDs.

I think the old version would have included the bootstraps in the run if you'd used it on these files - is that what you want? If not, I'd just remove those files from the folder. If it is, then I think I'd run the script separately on each window (i.e. use e.g. RAxMLfiles/RAxML_result.InWindow_1_to_300_bootstrap_) as the tree file argument.

It may be that I should rethink how input files are detected again, in the medium term.

damientully commented 5 years ago

Was there any particular reason to change to that way as if you have a set of bootstraps or posterior samples constructed from your single window alignment as input then the current version couldn't handle these. Also, If I run the script separately on each window no pairwise relationship summary is printed or summary plot outputted as these require multiple windows but the --allClassifications and --collapsedTrees options will get me the between host relationships for each tree.

Thanks!

mdhall272 commented 5 years ago

The reason was that I wanted to be able to detect window coordinates more flexibly if they existed. I had a request to allow a directory to be used as input, and in this case we need to be able to find coordinates in entire file names, not rely on them making up the suffix.

But it should work - I don't mean run on a single tree, just don't run it on an entire folder containing multiple sets of bootstraps from multiple trees. Is that what you wanted to do? I'm not sure I'd recommend it as it will make some outputs rather peculiar and I'd worry about misinterpretation.

I just fixed another bug, but if you pull, then:

./phyloscanner_analyse_trees.R RAxMLfiles/RAxML_bestTree.InWindow_1_to_300_bootstrap_ OR_all s,20 --allClassifications --collapsedTrees -m 1e-5 -rda -sdt 0.05 -rcm -og EF108306

will run all on all the bootstraps from the first window, and you can then loop over all the windows.

damientully commented 5 years ago

Yes that makes sense.

That works now on all the bootstraps.

Thanks again Matthew!