Draft content for block 3 of the workshop.

This is the superset of issues #27, #28, and #29.

Data

Workshop data available here (link takes you to dropbox)
Decompress the file and move it to your Cogent3_Workshop_2023 folder.
- Refer to this wiki page for instructions on importing data into your Docker environment.

Please open up a blank notebook in your docker container and set the data directory as follows. Note, that if you are using a different directory structure within Docker, you will need to change the path here accordingly.

from pathlib import Path

data_dir = Path("LO3/data/") # change path here if different.

Now, we are ready to go!

The road from orthologs to alignment 🛣️🚙

data-curation-pipeline

The original vector used in the above image was created by Elena Akifeva.

Curating data is like navigating a treacherous road. In this section we will show strategies for data exploration, and show how to use the outcomes of data exploration for the construction of well-informed data-wrangling pipelines, using cogent3 apps. Once we have constructed a pipeline, filtering data is like a well-maintained highway, it's efficient and straightforward. More importantly, these pipelines can be integrated into the alignment process, ensuring automated data provenance and reproducibility.

⚠️ Speed bumps

Small issues that can be easily fixed but might slow you down if you don't know how to fix them.

Overloaded Sequence Labels

Sometimes a fasta file will have an overloaded sequence label. For example, the label of this tardigrade mitochondrial sequence contains a lot of information! Let's have a look using load_seq

Example

from cogent3 import load_seq

overloaded_fasta = data_dir / "EU244602.fa"
tardigrade_mt_seq = load_seq(overloaded_fasta, moltype="dna")
tardigrade_mt_seq.name

Why is this an issue?

Downstream analysis may require consistent naming to collate genomes from one species.
In cogent3, to annotate a Sequence (allowing you the functionality of features) requires an exact match between the names in the sequence and annotation files.

Take a look at the help for load_seq and see if there is a way to get around this issue (without changing the name in the file, which would be tedious if you had a lot of files!).

Tip Execute load_seq? in a notebook cell to see the help for the function.

👀 Show solution

#### Solution - To change the name of the tardigrade sequence, use the `label_to_name` argument with a [lambda function](https://www.freecodecamp.org/news/python-lambda-function-explained/). - Lambda function: `lambda x: x.split()[0]` i.e., split the label on whitespace and take the first element. ```python seq_stripped = load_seq( overloaded_fasta, moltype="dna", label_to_name=lambda x: x.split()[0] ) seq_stripped.name ```

🕳️ Potholes

Issues that you want to avoid, but if you hit one, it shouldn't be too hard to get out of if you are prepared.

File Formats Issues

Sometimes a file does not match the standard of its suffix. For example, a file with the suffix .txt may contain sequence data in fasta format. AM088141.txt is an example of this. Let's see what happens if we try to load it using load_seq.

Example

hidden_fasta = data_dir / "AM088141.txt"
tardigrade_rRNA = load_seq(hidden_fasta, moltype="dna")

If you had a lot of files, it would be a pain to change the suffix for all of them. Take a look at the help for load_seq and see if there is a way to get around this issue.

👀 Show solution

#### Solution We can use the `format` argument to specify the format of the file. ```python tardigrade_rRNA = load_seq(hidden_fasta, format="fasta", moltype="dna") tardigrade_rRNA ```

Note If the content of the file isn’t a supported format, that is a different can of worms.. we will discuss this problem later.

🚧 Roadblocks

Roadblocks are issues that require a bit more thought to solve. There is often more than one way to get around a roadblock, you just need to find the right detour.

Which sequences are related?

Imagine you have just sequenced (and annotated) the genome of a new soil bacteria 🦠! A core step in science is to put your research into the context of what is known in the field. The context in this case is to compare to other soil bacteria, available in the REFSOIL reference dataset. For any sequence comparison or molecular evolution analyses, we will need information on what genes/sequences are homologous (we will be referring to orthologs since that's what is desired for species relationship reconstruction).

We have a collection of genomes and their annotations. We want to know what genes are related to each other. Can you come up with a strategy to solve this problem?

👀 Possible strategy

#### Use identical gene identifiers as a proxy for orthology! This is a hypothetical phylogenomic study based on the ~1k whole microbial genomes downloaded in GenBank format from NCBI. To do phylogenetics, we need to know orthologs. A naive strategy for identifying these is to select identically named genes from the genomes. Because of gene content variability (typical of microbes), there will be incomplete coverage. We are aiming to identify a set of genes present in as many species from this collection as possible. Using `cogent3`, and in particular the `annotation_db` module, we can select the maximum number of genes (identified by identical gene names) from the maximum number of species and write them to file. These ~90 genes in 370 species are in the data directory at the top of the section (in `data/raw/`) and will be used as an example for the rest of this section. You can load the `AnnotationDb` for yourself as follows: ```py from cogent3.core.annotation_db import GenbankAnnotationDb path = data_dir / "features.gbdb" db = GenbankAnnotationDb(source=path) db ``` Take a look at the documentation for annotation handling [here](https://cogent3.org/doc/cookbook/annotation_db.html#) and explore! Ask someone for tips if needed, and tell us about the experience.

⛔️ Error: road not found

These are hard problems to solve.

Which sequences are actually related ?

Following on from the previous step, if we were to blindly assume that identical gene identifiers were indicative of sequences being orthologous, we would be putting a lot of trust into the person who annotated the sequences. The meta-data associated with a sequence is frequently dubious. How can we verify that the sequences are actually related? Although this is a specific example, the question of whether sequences purported to be orthologous, are actually orthologous, is an extremely common one!

👀 Possible strategy

#### Wrangling even when GenBank annotations give you trust issues 🫣 There is no "one size fits all" solution to this problem. So let's start exploring. Starting with just one gene selected from the previous step, `"alaS"`, which we can load into a `SequenceCollection` using `load_unaligned_seqs`. Accessing the `.num_seqs` attribute, we can see that there are 370 sequences in the collection. ```python3 from cogent3 import load_unaligned_seqs alaS_path = data_dir / "raw/alaS.fa" seq_coll = load_unaligned_seqs(alaS_path, moltype="dna") seq_coll.num_seqs ``` In `cogent3`, a k-mer-based approach offers a quick estimation of sequence relationships without requiring alignment. ##### "There's an app for that!" This approach is an example of `cogent3` "[apps](https://cogent3.org/doc/app/index.html)"! Apps are special functions designed to simplify complex tasks. Apps can be used individually, or combined to define a “composed function” (aka pipeline). Composed functions can be applied to a single, or thousands, of data file(s)! We need two apps for this task: - `jaccard_dist()`: Calculates pairwise Jaccard distances between the set of k-mers in each sequence within a `SequenceCollection`. - ` approx_pdist()`: Transforms the matrix of pairwise Jaccard distances into an approximation of "proportion sites different," a measure of sequence divergence. Note that before we can apply the apps to our data, we need to initialise them. ```python from cogent3.app.dist import jaccard_dist, approx_pdist # create instances of the apps p_dist = approx_pdist() j_dist = jaccard_dist() ``` To combine these apps into a pipeline, so that the output of the first is the input of the second, we add them together using the `+` operator. ```python pipeline = j_dist + p_dist ``` If we are applying the pipeline to a single dataset, we can simply provide the data set as the argument to the app. Later in this workshop, we will learn how to apply a pipeline to a collection of datasets using the `apply_to()` method. Applying the pipeline to our sequence collection returns a `DistanceMatrix` object with an approximation of the distance between each pair of sequences. ```python pw_pdists = pipeline(seq_coll) pw_pdists ``` ##### What does the data look like? A way to get a feel for the data is to visualise the distribution of all pairwise distances. Are there any outliers? ```python from itertools import combinations import plotly.express as px # get distances into list for plotting dists = [pw_pdists[pair] for pair in combinations(pw_pdists.keys(), 2)] fig = px.histogram(dists, nbins=50, labels={"value": "p-distance"}) fig.show(width=500, height=500) ``` How would you describe the distribution❓ ##### What's the worst-case scenario? Which are the (approximately) most divergent sequences: ```python import numpy as np # find max index (using numpy magic) max_index_flat = np.argmax(pw_pdists) max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape) # index .names to get the names of these sequences max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2]) max_pair ``` ##### Dotplots: Visualising the relationship between two sequences A dotplot is a visualisation of the relationship between two sequences. The x-axis is the first sequence, the y-axis is the second sequence. Where the sequences have segments that are similar beyond a threshold, a line is drawn. The visualisation backend for `cogent3` is `plotly`, which means that the dotplot is ✨interactive✨! Try zooming in by clicking and dragging on the plot. Double-click to reset the zoom. Click on the legend to toggle the visibility of the reverse complement matches. ```python dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True) dp.show(width=500, height=500) ``` ❓Are you convinced that these sequences are orthologous?

What does kath think...

i'm not convinced! But this is an open question.

##### How can we quantify our confidence in orthology? When comparing sequences, a fundamental question arises: what's the probability of obtaining a similarity score purely by chance? [Karlin and Altschul (1990)](https://pubmed.ncbi.nlm.nih.gov/2315319/) derived the theoretical null distribution of local alignment scores[^1], enabling the calculation of the probability of a score equal to or higher than the obtained score (a p-value). This theory forms the basis of the BLAST algorithms. We can directly use this theory to calculate the probability that the observed similarity between two sequences is due to chance. Inside the script `filter.py`[^3] are a few functions that implement this logic. One of them is `theoretical_null_filter()`, an app that filters a `SequenceCollection` based on the theoretical null distribution, retaining only the sequences that satisfy a threshold of probability of being related![^2] To use `theoretical_null_filter()`, we need to set a reference sequence for comparison against all other sequences. Here, let's choose E. coli as the reference. Additionally, we'll specify a desired p-value threshold. ```python from LO3.filter import theoretical_null_filter # we want E. coli as our reference ref_seq = "NC_000913" # create instance of the app, with a threshold of 0.01 tnf = theoretical_null_filter(ref_seq, threshold=0.01) tnf ``` We can see that the app has been initialised with _E. coli_ as reference, there are other default parameters that we could have changed if we wanted to. Now we can apply the app to `seq_coll` to filter out likely non-homologous genes. ```python seq_coll_filtered = tnf(seq_coll) seq_coll_filtered.num_seqs ``` We started with 370 sequences, so the filtering process removed several sequences! ##### What does the most divergent pair look like now? ```python pw_pdists = seq_coll_filtered.distance_matrix() max_index_flat = np.argmax(pw_pdists) max_index_1, max_index_2 = np.unravel_index(max_index_flat, pw_pdists.shape) max_pair = (pw_pdists.names[max_index_1], pw_pdists.names[max_index_2]) dp = seq_coll.dotplot(max_pair[0], max_pair[1], rc=True) dp.show(width=500, height=500) ``` ❓Are you convinced that these sequences are orthologous? (this is again an open question!)

How to assess the quality of an alignment?

Why use cogent3 for alignment?

A quick note on why to use `cogent3` for alignments? - even once sequences have been aligned, annotated features remain on the sequences, and can be mapped onto the alignment. This allows you to do things like slice your new sequence based on an annotated gene on another sequence via the alignment. - novel codon aligners!

Prior to aligning, we need to remove the terminal stop codons. Interestingly, some (cds) sequences have internal in-frame stop codons (@genbank 😬) that we need to remove altogether! There is a custom app in the filter.py script that will do this.

from LO3.filter import remove_stop_codons

# its an app, so init it
no_stop_codons = remove_stop_codons(check_inframe=True)

# apply it to our collection
seq_coll_filtered = no_stop_codons(seq_coll_filtered)

# how many sequences did we lose?
seq_coll_filtered.num_seqs

Now we should be good to align! cogent3 has a progressive alignment app that can align coding sequences using a codon mode. We can use get_app to grab the app, and then apply it to our collection.

from cogent3 import get_app

codon_aligner = get_app("progressive_align", "codon")
aln = codon_aligner(seq_coll_filtered)
aln

As you can see, there are plenty of gaps! To take a look at non-degenerate positions, execute the following

aln.no_degenerates()

❓ How can we assess the quality of this alignment? cogent3 has three different alignment quality metrics, the information content score ("ic_score"), the sum of pairs score ("sp_score"), and the log-likelihood of the alignment ("cogent3_score"). You can access them on an alignment in the following way:

aln.alignment_quality("ic_score")

Warning Alignment quality scores are imperfect, and there are a few things to keep in mind:

what does a singular score tell you?

describing the quality of an alignment with a score that has good properties is an open problem.

the cogent3_score is only available for alignments produced using cogent3. It is a likelihood measure but suffers from underflow issues.

Are there other ways we can assess the quality of an alignment?

👀 Possible strategies

#### Visualisation We can quantify alignment quality via visualisation. This could be with dotplots, as we did earlier. We can also visualise the number of gaps and entropy using the `information_plot` method. ```py aln.information_plot() ``` There are a lot of gaps! In some regions, almost all sequences have a gap. Some regions where there are very few gaps. To see if there are certain sequences that are contributing to the gaps, we can use the `aln.count_gaps_per_seq()` which returns the number of gaps per sequence. Plotting this as a violin plot shows up the distribution of gaps per sequence. ```py gaps = aln.count_gaps_per_seq() fig = px.violin(gaps, box=True,) fig.update_layout(width=500, height=500) fig.show() ``` To remove sequences that are contributing a significant number of gaps, we can use the `omit_bad_seqs` method. `omit_bad_seqs` returns new alignment without sequences for which the number of uniquely introduced gaps exceeds the specified quantile of the distribution of gaps per sequence (shown in the violin plot). ```py aln_omitted = aln.omit_bad_seqs(quantile=0.99) aln_omitted.num_seqs ``` or use `omit_gap_seqs`., which removes sequences that have more than a specified percentage gap. ```py aln_omitted = aln_omitted.omit_gap_seqs(allowed_gap_frac=0.65) aln_omitted.num_seqs ```

Off the mountain road and on the highway.

All that we've learned about our data can be seamlessly woven into a pipeline.

We've already seen that apps can be composed together to form a pipeline (recall the jaccard_dist and approx_pdist pipeline). However, we then applied this pipeline to a single SequenceCollection. We can also apply a pipeline to many files at once. This requires us to collect the files that we want to apply the data to in a DataStore. If our files are already in a directory, this is easy! We just use open_data_store, with the following information:

the path to the data
the suffix of the data of interest (indicating that all files with this suffix are of are a "DataMember" of the DataStore)
the mode (read/write)
optionally, an argument to limit the number of files to read/write. In this case, because we're building a prototype and want things to be fast, we will limit the number of files to be 5.

from cogent3.app.io import open_data_store

in_dstore = open_data_store(data_dir / "raw", suffix=".fa", mode="r", limit=5)

Similarly, we need a data store for the output, where the pipeline's result will be written out. Once more, we specify the path, suffix, and mode using open_data_store:

# open data stores for the output
out_dstore = open_data_store(data_dir / "aligned", suffix=".fa", mode="w")

Now let's build our pipeline. Directly from our exploration, we know that we need to do the following:

filter out sequences that are likely not orthologous
remove stop codons (and check for internal stop codons!)
remove sequences that are contributing a significant number of gaps
aligning with the default parameters of the progressive codon aligner appears to be reasonable.

Additionally, there are filtering steps that are common to many analyses, many of these will be predefined apps. To take a look at all predefined apps, execute available_apps() in a notebook cell.

from cogent3.app import available_apps

available_apps()

Our methods may require a minimum length of alignment. For this, we would use the min_length app to filter out alignments that are too short. Perhaps we wish to apply methods to data where the impact of natural selection is minimised, for which we would use the take_codon_positions app to select only the third codon position.

# we can use ``get_app`` to grab the apps that we need
from cogent3 import get_app

reader = get_app("load_unaligned", moltype="dna")
filterer = get_app("theoretical_null_filter", ref_seq)
trim_stops = remove_stop_codons(check_inframe=True)
no_shorties = get_app("min_length", length=300)
codon_aligner = get_app("progressive_align", "codon")
writer = get_app("write_seqs", out_dstore)

# Now we can create the pipeline
align_pipeline = reader + filterer + trim_stops + no_shorties + codon_aligner + writer

Apply the pipeline using the apply_to method using the input data store as the first argument. The pipeline can easily be parallelised using the parallel=True argument. A more complex configuration can be done if required (like telling it to use MPI) but we won't go into this here! Importantly, assigning the result of apply_to to a variable will allow us to interrogate the results of the pipeline run.

r = align_pipeline.apply_to(
      in_dstore, 
      show_progress=True, 
      parallel=True
  )

We can look at the summary of the pipeline run by accessing the describe attribute of the result object.

r.describe

Conveniently, failures do not cause the whole pipeline to fail. Importantly, they are tracked, if we have failures we can interrogate them in the not_completed directory in the data store. To get a summary of failures we can use the summary_not_completed attribute.

  r.summary_not_completed

Look inside the output directory to see the results of the pipeline run. In addition to the "curated" data, there is a directory with a detailed file for all failures. There is a directory that collects the log files, take a look at these to see a summary of the pipeline. It provides all the information needed to recreate the exact apps and the exact pipeline you ran, i.e., it is reproducible! 🎉

[^1]: The theorem assumes that nucleotides in the sequence are independent and identically distributed variables! Something that is not true for real sequences. [^3]: Functions in the filter.py script are not yet part of the cogent3 package as they are still in development. Use at your own risk! [^2]: This approach works best for sequences with a reasonably even distribution of nucleotides. For sequences with a skewed nucleotide composition, using numerical simulation to generate the null is a better approach, see empirical_null_filter!

cogent3 / Cogent3Workshop

LO - 3 - From orthologs to alignment #40

Data

The road from orthologs to alignment 🛣️🚙

⚠️ Speed bumps

Overloaded Sequence Labels

Example

Why is this an issue?

🕳️ Potholes

File Formats Issues

Example

🚧 Roadblocks

Which sequences are related?

⛔️ Error: road not found

Which sequences are actually related ?

How to assess the quality of an alignment?

Off the mountain road and on the highway.