Closed KatherineCaley closed 9 months ago
tardigrade 18s rRNA AM088141.txt
@khiron I think plotly is required in the docker image
@khiron and @fredjaya, this content is ready to be tried out in the container if you have time ☺️
feel free to either edit/dd/change/clarify the issue directly or post a comment with feedback on what sucks and I can implement it 🤝
@khiron I think plotly is required in the docker image
I was not installing "cogent3[extras]". fixed now
@khiron did you have any comments on this content?
Draft content for block 3 of the workshop.
This is the superset of issues #27, #28, and #29.
Data
Cogent3_Workshop_2023
folder.Please open up a blank notebook in your docker container and set the data directory as follows. Note, that if you are using a different directory structure within Docker, you will need to change the path here accordingly.
Now, we are ready to go!
The road from orthologs to alignment 🛣️🚙
The original vector used in the above image was created by Elena Akifeva.
Curating data is like navigating a treacherous road. In this section we will show strategies for data exploration, and show how to use the outcomes of data exploration for the construction of well-informed data-wrangling pipelines, using
cogent3
apps. Once we have constructed a pipeline, filtering data is like a well-maintained highway, it's efficient and straightforward. More importantly, these pipelines can be integrated into the alignment process, ensuring automated data provenance and reproducibility.⚠️ Speed bumps
Small issues that can be easily fixed but might slow you down if you don't know how to fix them.
Overloaded Sequence Labels
Sometimes a fasta file will have an overloaded sequence label. For example, the label of this tardigrade mitochondrial sequence contains a lot of information! Let's have a look using
load_seq
Example
Why is this an issue?
cogent3
, to annotate aSequence
(allowing you the functionality of features) requires an exact match between the names in the sequence and annotation files.Take a look at the help for
load_seq
and see if there is a way to get around this issue (without changing the name in the file, which would be tedious if you had a lot of files!).👀 Show solution
#### Solution - To change the name of the tardigrade sequence, use the `label_to_name` argument with a [lambda function](https://www.freecodecamp.org/news/python-lambda-function-explained/). - Lambda function: `lambda x: x.split()[0]` i.e., split the label on whitespace and take the first element. ```python seq_stripped = load_seq( overloaded_fasta, moltype="dna", label_to_name=lambda x: x.split()[0] ) seq_stripped.name ```🕳️ Potholes
Issues that you want to avoid, but if you hit one, it shouldn't be too hard to get out of if you are prepared.
File Formats Issues
Sometimes a file does not match the standard of its suffix. For example, a file with the suffix
.txt
may contain sequence data in fasta format.AM088141.txt
is an example of this. Let's see what happens if we try to load it usingload_seq
.Example
If you had a lot of files, it would be a pain to change the suffix for all of them. Take a look at the help for
load_seq
and see if there is a way to get around this issue.👀 Show solution
#### Solution We can use the `format` argument to specify the format of the file. ```python tardigrade_rRNA = load_seq(hidden_fasta, format="fasta", moltype="dna") tardigrade_rRNA ```🚧 Roadblocks
Roadblocks are issues that require a bit more thought to solve. There is often more than one way to get around a roadblock, you just need to find the right detour.
Which sequences are related?
Imagine you have just sequenced (and annotated) the genome of a new soil bacteria 🦠! A core step in science is to put your research into the context of what is known in the field. The context in this case is to compare to other soil bacteria, available in the REFSOIL reference dataset. For any sequence comparison or molecular evolution analyses, we will need information on what genes/sequences are homologous (we will be referring to orthologs since that's what is desired for species relationship reconstruction).
We have a collection of genomes and their annotations. We want to know what genes are related to each other. Can you come up with a strategy to solve this problem?
👀 Possible strategy
#### Use identical gene identifiers as a proxy for orthology! This is a hypothetical phylogenomic study based on the ~1k whole microbial genomes downloaded in GenBank format from NCBI. To do phylogenetics, we need to know orthologs. A naive strategy for identifying these is to select identically named genes from the genomes. Because of gene content variability (typical of microbes), there will be incomplete coverage. We are aiming to identify a set of genes present in as many species from this collection as possible. Using `cogent3`, and in particular the `annotation_db` module, we can select the maximum number of genes (identified by identical gene names) from the maximum number of species and write them to file. These ~90 genes in 370 species are in the data directory at the top of the section (in `data/raw/`) and will be used as an example for the rest of this section. You can load the `AnnotationDb` for yourself as follows: ```py from cogent3.core.annotation_db import GenbankAnnotationDb path = data_dir / "features.gbdb" db = GenbankAnnotationDb(source=path) db ``` Take a look at the documentation for annotation handling [here](https://cogent3.org/doc/cookbook/annotation_db.html#) and explore! Ask someone for tips if needed, and tell us about the experience.⛔️ Error: road not found
These are hard problems to solve.
Which sequences are actually related ?
Following on from the previous step, if we were to blindly assume that identical gene identifiers were indicative of sequences being orthologous, we would be putting a lot of trust into the person who annotated the sequences. The meta-data associated with a sequence is frequently dubious. How can we verify that the sequences are actually related? Although this is a specific example, the question of whether sequences purported to be orthologous, are actually orthologous, is an extremely common one!
👀 Possible strategy
#### Wrangling even when GenBank annotations give you trust issues 🫣 There is no "one size fits all" solution to this problem. So let's start exploring. Starting with just one gene selected from the previous step, `"alaS"`, which we can load into a `SequenceCollection` using `load_unaligned_seqs`. Accessing the `.num_seqs` attribute, we can see that there are 370 sequences in the collection. ```python3 from cogent3 import load_unaligned_seqs alaS_path = data_dir / "raw/alaS.fa" seq_coll = load_unaligned_seqs(alaS_path, moltype="dna") seq_coll.num_seqs ``` In `cogent3`, a k-mer-based approach offers a quick estimation of sequence relationships without requiring alignment. ##### "There's an app for that!" This approach is an example of `cogent3` "[apps](https://cogent3.org/doc/app/index.html)"! Apps are special functions designed to simplify complex tasks. Apps can be used individually, or combined to define a “composed function” (aka pipeline). Composed functions can be applied to a single, or thousands, of data file(s)! We need two apps for this task: - `jaccard_dist()`: Calculates pairwise Jaccard distances between the set of k-mers in each sequence within a `SequenceCollection`. - ` approx_pdist()`: Transforms the matrix of pairwise Jaccard distances into an approximation of "proportion sites different," a measure of sequence divergence. Note that before we can apply the apps to our data, we need to initialise them. ```python from cogent3.app.dist import jaccard_dist, approx_pdist # create instances of the apps p_dist = approx_pdist() j_dist = jaccard_dist() ``` To combine these apps into a pipeline, so that the output of the first is the input of the second, we add them together using the `+` operator. ```python pipeline = j_dist + p_dist ``` If we are applying the pipeline to a single dataset, we can simply provide the data set as the argument to the app. Later in this workshop, we will learn how to apply a pipeline to a collection of datasets using the `apply_to()` method. Applying the pipeline to our sequence collection returns a `DistanceMatrix` object with an approximation of the distance between each pair of sequences. ```python pw_pdists = pipeline(seq_coll) pw_pdists ``` ##### What does the data look like? A way to get a feel for the data is to visualise the distribution of all pairwise distances. Are there any outliers? ```python from itertools import combinations import plotly.express as px # get distances into list for plotting dists = [pw_pdists[pair] for pair in combinations(pw_pdists.keys(), 2)] fig = px.histogram(dists, nbins=50, labels={"value": "p-distance"}) fig.show(width=500, height=500) ``` How would you describe the distribution❓ ##### What's the worst-case scenario? Which are the (approximately) most divergent sequences: ```python import numpy as np # find max index (using numpy magic) max_index_flat = np.argmax(pw_pdists) max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape) # indexWhat does kath think...
i'm not convinced! But this is an open question.How to assess the quality of an alignment?
Why use cogent3 for alignment?
A quick note on why to use `cogent3` for alignments? - even once sequences have been aligned, annotated features remain on the sequences, and can be mapped onto the alignment. This allows you to do things like slice your new sequence based on an annotated gene on another sequence via the alignment. - novel codon aligners!Prior to aligning, we need to remove the terminal stop codons. Interestingly, some (cds) sequences have internal in-frame stop codons (@genbank 😬) that we need to remove altogether! There is a custom app in the
filter.py
script that will do this.Now we should be good to align!
cogent3
has a progressive alignment app that can align coding sequences using a codon mode. We can useget_app
to grab the app, and then apply it to our collection.As you can see, there are plenty of gaps! To take a look at non-degenerate positions, execute the following
❓ How can we assess the quality of this alignment?
cogent3
has three different alignment quality metrics, the information content score ("ic_score"
), the sum of pairs score ("sp_score"
), and the log-likelihood of the alignment ("cogent3_score"
). You can access them on an alignment in the following way:Are there other ways we can assess the quality of an alignment?
👀 Possible strategies
#### Visualisation We can quantify alignment quality via visualisation. This could be with dotplots, as we did earlier. We can also visualise the number of gaps and entropy using the `information_plot` method. ```py aln.information_plot() ``` There are a lot of gaps! In some regions, almost all sequences have a gap. Some regions where there are very few gaps. To see if there are certain sequences that are contributing to the gaps, we can use the `aln.count_gaps_per_seq()` which returns the number of gaps per sequence. Plotting this as a violin plot shows up the distribution of gaps per sequence. ```py gaps = aln.count_gaps_per_seq() fig = px.violin(gaps, box=True,) fig.update_layout(width=500, height=500) fig.show() ``` To remove sequences that are contributing a significant number of gaps, we can use the `omit_bad_seqs` method. `omit_bad_seqs` returns new alignment without sequences for which the number of uniquely introduced gaps exceeds the specified quantile of the distribution of gaps per sequence (shown in the violin plot). ```py aln_omitted = aln.omit_bad_seqs(quantile=0.99) aln_omitted.num_seqs ``` or use `omit_gap_seqs`., which removes sequences that have more than a specified percentage gap. ```py aln_omitted = aln_omitted.omit_gap_seqs(allowed_gap_frac=0.65) aln_omitted.num_seqs ```Off the mountain road and on the highway.
All that we've learned about our data can be seamlessly woven into a pipeline.
We've already seen that apps can be composed together to form a pipeline (recall the
jaccard_dist
andapprox_pdist
pipeline). However, we then applied this pipeline to a singleSequenceCollection
. We can also apply a pipeline to many files at once. This requires us to collect the files that we want to apply the data to in aDataStore
. If our files are already in a directory, this is easy! We just useopen_data_store
, with the following information:Similarly, we need a data store for the output, where the pipeline's result will be written out. Once more, we specify the path, suffix, and mode using
open_data_store
:Now let's build our pipeline. Directly from our exploration, we know that we need to do the following:
Additionally, there are filtering steps that are common to many analyses, many of these will be predefined apps. To take a look at all predefined apps, execute
available_apps()
in a notebook cell.Our methods may require a minimum length of alignment. For this, we would use the
min_length
app to filter out alignments that are too short. Perhaps we wish to apply methods to data where the impact of natural selection is minimised, for which we would use thetake_codon_positions
app to select only the third codon position.Apply the pipeline using the
apply_to
method using the input data store as the first argument. The pipeline can easily be parallelised using theparallel=True
argument. A more complex configuration can be done if required (like telling it to use MPI) but we won't go into this here! Importantly, assigning the result ofapply_to
to a variable will allow us to interrogate the results of the pipeline run.We can look at the summary of the pipeline run by accessing the
describe
attribute of the result object.Conveniently, failures do not cause the whole pipeline to fail. Importantly, they are tracked, if we have failures we can interrogate them in the
not_completed
directory in the data store. To get a summary of failures we can use thesummary_not_completed
attribute.Look inside the output directory to see the results of the pipeline run. In addition to the "curated" data, there is a directory with a detailed file for all failures. There is a directory that collects the log files, take a look at these to see a summary of the pipeline. It provides all the information needed to recreate the exact apps and the exact pipeline you ran, i.e., it is reproducible! 🎉
[^1]: The theorem assumes that nucleotides in the sequence are independent and identically distributed variables! Something that is not true for real sequences. [^3]: Functions in the
filter.py
script are not yet part of thecogent3
package as they are still in development. Use at your own risk! [^2]: This approach works best for sequences with a reasonably even distribution of nucleotides. For sequences with a skewed nucleotide composition, using numerical simulation to generate the null is a better approach, seeempirical_null_filter
!