Write a tutorial that explains how nucleotide distance matrices are generated

Deena-B commented 5 years ago

Aim

Write a tutorial (with pictures) that explains how nucleotide distance matrices are generated

Background

Since it is best to start simple and get more complex, we are starting by understanding how the difference (aka distance) between two nucleotide strings (.fasta files) is calculated. This can also be called the 'hamming distance' or sequence identity.

Sequence identity is calculated by going along two aligned nucleotide strings and asking "are these nucleotides matched or mismatched?" If they are a match, they score 1, if they are not matched, they score 0. The sum of scores is then divided by the number of nucleotides in the aligned sequence. The distance (or difference) is 1 minus the identity. Some packages take the square root of (1-identity).

The calculation above is performed between all possible sequence pairs in an alignment and the values are put into a table (often referred to as a matrix by R people). This table is called a distance matrix and it is used to generate lineage trees (aka hierarchical clusters or dendrograms).

We need to see what calculations are used by commonly used packages and decide which one we want to use.

Eventually we will figure out a way to calculate distances between .vcf files.

Deena-B commented 5 years ago

Pasted from a conversation on Slack between Mei Eisenbach, Nick Tilmans, & Deena

BioPython uses a DistanceCalculator class R uses dist.alignment

The R page above says: "These functions compute a matrix of pairwise distances from aligned sequences using ... identity matrix (for protein and DNA sequences). The resulting matrix contains the squared root of the pairwise distances. For example, if identity between 2 sequences is 80 the squared root of (1.0 - 0.8) i.e. 0.4472136."

BioPython's DistanceCalculator has a function _pairwise(self, seq1, seq2) where the docstring says Calculate pairwise distance from two sequences (PRIVATE). ...

The BioPython function uses scoring_matrix and skip_letters, which are defined in __init__. The default is identity.

It looks like BioPython does not take the square root (like R does)

(Nicolas Tilmans) Frequently the root is taken to squash directionality and to blunt the effect of outlier distances. Hamming distance isn’t squared internally (euclidean is) and doesn’t appear to be signed so I’m not sure why here. Could be to make very similar sequences not swamp signal from less similar sequences?

Deena-B commented 5 years ago

We need to understand the difference between identity, blastn, and trans - the three algorithms used by biopython to create distance matrices

Deena-B commented 5 years ago

When there is a gap '-' what score is given? Presumably it is a zero, because they don't match. Importantly, gaps should be counted in the total number of nucleotides in the string (i.e. the denominator).

deepcelllineage / mitolin

Write a tutorial that explains how nucleotide distance matrices are generated #12

Aim

Background