galaxyproject / training-material

A collection of Galaxy-related training material
https://training.galaxyproject.org
MIT License
294 stars 846 forks source link

MTB phylogenetics tutorial #3220

Closed cstritt closed 2 years ago

cstritt commented 2 years ago

This is a second tutorial for the planned Galaxy workshop on WGS of M. tuberculosis (see request #3211). It covers the interpretation and inference of phylogenetic trees.

hexylena commented 2 years ago

@cstritt @pvanheus do you think this fits ok to any of the existing GTN topics? https://training.galaxyproject.org/ we try and avoid creating new topics for just a single tutorial, when possible. Maybe visualisation? Sequence analysis feels very NGS-y, but we're trying to expand it, maybe there?

pvanheus commented 2 years ago

@cstritt @pvanheus do you think this fits ok to any of the existing GTN topics? https://training.galaxyproject.org/ we try and avoid creating new topics for just a single tutorial, when possible. Maybe visualisation? Sequence analysis feels very NGS-y, but we're trying to expand it, maybe there?

So I see two issues here:

  1. the work here forms part of a theme - which is quite an exciting development and not particularly well catered for in the GTN repo. I.e. @cstritt et al probably have a workshop website that pulls together at least 3 Galaxy tutorials with other background into a coherent exploration of the topic (M. tuberculosis sequence analysis / bioinformatics). Does that mean there should be some tags to make this theme easier to follow?
  2. perhaps we need a phylogeny category? You and I have discussed SARS-CoV-2 phylogeny, now this is M. tuberculosis phylogeny - maybe a new category won't be on its own for long? On the other hand, where does the transmission analysis tutorial fit? Is there perhaps a larger category of "relatedness analysis" or "evolution" that is a better fit here?
cstritt commented 2 years ago

@hexylena , @pvanheus , many thanks for the helpful comments! I'll start working on them today. Regarding the category for the tutorial, I like the idea of an 'evolution' topic, as suggested by @pvanheus (there already is 'ecology'). The current topics don't really fit, I'd be surprised to find phylogenetics there...

hexylena commented 2 years ago

perhaps we need a phylogeny category? You and I have discussed SARS-CoV-2 phylogeny, now this is M. tuberculosis phylogeny - maybe a new category won't be on its own for long? On the other hand, where does the transmission analysis tutorial fit? Is there perhaps a larger category of "relatedness analysis" or "evolution" that is a better fit here?

That can make sense to me. The thing we try and avoid is topics with a single tutorial, but with our discussed covid phylogeny, yeah, that makes more sense. Evolution it is.

pvanheus commented 2 years ago

Just one more thought here - there really is not much of a workflow for this tutorial because it follows on from previous work. Its not a stand-alone. I understand the desire to note make the "transmission" tutorial too long, but perhaps add a workflow that illustrates the process from VCF to phylogeny at least?

cstritt commented 2 years ago

Just one more thought here - there really is not much of a workflow for this tutorial because it follows on from previous work. Its not a stand-alone. I understand the desire to note make the "transmission" tutorial too long, but perhaps add a workflow that illustrates the process from VCF to phylogeny at least?

At present the tutorial is conceived as part of a workshop, where other tutorials and webinars cover sequencing, SNP calling, etc. Thus the students will go from VCF to alignments in the clustering tutorial in the morning, and from there to the phylogeny in the afternoon. Maybe it would make sense to extent the tutorial into a standalone after the workshop?

pvanheus commented 2 years ago

Just one more thought here - there really is not much of a workflow for this tutorial because it follows on from previous work. Its not a stand-alone. I understand the desire to note make the "transmission" tutorial too long, but perhaps add a workflow that illustrates the process from VCF to phylogeny at least?

At present the tutorial is conceived as part of a workshop, where other tutorials and webinars cover sequencing, SNP calling, etc. Thus the students will go from VCF to alignments in the clustering tutorial in the morning, and from there to the phylogeny in the afternoon. Maybe it would make sense to extent the tutorial into a standalone after the workshop?

Perhaps.

BTW thinking about your workflow again, I realised that you don't address ascertainment bias. Perhaps constant sites can be computed in the previous tutorial (snp_sites has a mode for computing constant sites... its actually aimed at IQ-TREE's -fconst parameter... I'm not sure if RAxML has a direct equivalent?) and copied over to here? (As an example, here's a workflow that is similar to what is done in your set of tutorials but adds that constant site calculation: https://galaxy.sanbi.ac.za/u/pvanheus/w/snippy-tb-sample-iqtree-015)

cstritt commented 2 years ago

I just realized that the ape library is not available in RStudio on Galaxy. Would it be possible to install it?

shiltemann commented 2 years ago

I just realized that the ape library is not available in RStudio on Galaxy. Would it be possible to install it?

Users can install libraries as needed in Rstudio in Galaxy. That said, if this would e.g. take too much time we can look into changing the base image to include the library.

cstritt commented 2 years ago

Users can install libraries as needed in Rstudio in Galaxy. That said, if this would e.g. take too much time we can look into changing the base image to include the library.

install.packages("ape") crashes with:

/bin/sh: 1: x86_64-conda-linux-gnu-cc: not found make: *** [/opt/miniconda/lib/R/etc/Makeconf:170: BIONJ.o] Error 127 ERROR: compilation failed for package ‘ape’

cstritt commented 2 years ago

BTW thinking about your workflow again, I realised that you don't address ascertainment bias. Perhaps constant sites can be computed in the previous tutorial (snp_sites has a mode for computing constant sites... its actually aimed at IQ-TREE's -fconst parameter... I'm not sure if RAxML has a direct equivalent?) and copied over to here? (As an example, here's a workflow that is similar to what is done in your set of tutorials but adds that constant site calculation: https://galaxy.sanbi.ac.za/u/pvanheus/w/snippy-tb-sample-iqtree-015)

@pvanheus , This was indeed a weighty omission. I now address it in the alignment part, and added a section at the end about rescaling and dating the tree. I use the rescaled branch lengths = (branch lengths * alignment length) / genome size approach, and ask in the exercise what could be the problem of assuming that sites not present in the SNP alignment are invariant.

pvanheus commented 2 years ago

On the linting errors:

  1. The link error is waiting for the TB transmission tutorial to be merged, so hopefully that can be merged soon
  2. The tag issue... we said there needs to be a evolution category, right. What is involved in making such a thing @hexylena ?
shiltemann commented 2 years ago

@pvanheus, to create a new topic: https://training.galaxyproject.org/training-material/topics/contributing/tutorials/create-new-topic/tutorial.html

(and I am realising I forgot to add instructions for faq folder there, but I can help too)

cstritt commented 2 years ago

So the only thing which remains to be done on our side is to create the 'evolution' topic and move both tutorials there, right? As far as I can see this would only involve renaming the existing folder ('phylogenetics') and modify the corresponding metadata.yml. I'm not sure, though, how both tutorials can be moved there, given that they are both in open pull requests

shiltemann commented 2 years ago

@cstritt yes, @hexylena and I will deal with the renaming and moving this morning. We will merge it as draft tutorials, so that it will be accessible for your course next week, and afterwards we can polish all the last things.

(We have been thinking for a while already to rename metagenomics topic to "microbial analysis" so then it could fit there as well)

shiltemann commented 2 years ago

Users can install libraries as needed in Rstudio in Galaxy. That said, if this would e.g. take too much time we can look into changing the base image to include the library.

install.packages("ape") crashes with:

/bin/sh: 1: x86_64-conda-linux-gnu-cc: not found make: *** [/opt/miniconda/lib/R/etc/Makeconf:170: BIONJ.o] Error 127 ERROR: compilation failed for package ‘ape’

* removing ‘/opt/miniconda/lib/R/library/ape’

@cstritt You might be able to install in via conda (using the terminal tab in Rstudio) ..I'm testing it now and will add it to the instructions in the tutorial if it works :+1:

shiltemann commented 2 years ago

ok @cstritt, it appears to work if you install via conda :+1: ..it does give a warning that the package was built with R 4.1.2 while the Rstudio runs 4.1.0. It probably won't be a problem, but maybe good to test

I will merge this now

shiltemann commented 2 years ago

@cstritt here are the links to your tutorials:

https://training.galaxyproject.org/training-material/topics/evolution/tutorials/mtb_transmission/tutorial.html https://training.galaxyproject.org/training-material/topics/evolution/tutorials/mtb_phylogeny/tutorial.html

(I've also put them on the course program page)

cstritt commented 2 years ago

Excellent, thanks a lot for the great support!