biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

calculate effective family size for MSA_ripe rule #26

Open tkosciol opened 7 years ago

tkosciol commented 7 years ago

the script should live in: scripts/calculate_Neff.py Once HHblits is completed we need to calculate the effective family size from a3m file.

N_eff = N_clu(%) / sqrt(N_aa)

depending on N_eff and length of the protein the script should make a decision if the MSA is "ripe" for Rosetta predictions, or "not_ripe".

tkosciol commented 7 years ago

@qiyunzhu don't work on this just yet. I will let you know when the time is right and give you more information how it's supposed to look like.

qiyunzhu commented 7 years ago

Sure! @tkosciol

tkosciol commented 7 years ago

Hi @qiyunzhu I think now is the time to look into this.

Input: MSA in a3m format. Example on barnacle in: /projects/microprot/benchmarking/contact_predictions/aln/hhblits

Bash equivalent of command-line conversion from A3M to FASTA-like alignment format (and remove identical rows):

egrep -v "^>" example.a3m | sed 's/[a-z]//g' | sort -u > example.aln

so a quick and inacurrate way to do it, would be to calculate the size of such example.aln.

To be more accurate, I'd calculate a Hamming distance between all pairs of sequences in aln and cluster together sequences >80% identity, then following the formula, we'd calculate: N_eff = N_clu(80%) / sqrt(N_aa)

Note, those 2 approaches give results of different scales! And it's fine, as long as I know which one we're using.:)

sjanssen2 commented 7 years ago

Why is process_fasta.py and its test touched? @qiyunzhu Is that due to a missing pull / merge of you branch from master?

qiyunzhu commented 7 years ago

Hi @sjanssen2 I am not clear of the jargon. Let me talk with you in the office.

sjanssen2 commented 7 years ago

Here are my thoughts on the overall problem:

A set of sequences that are supposed to form one alignment are evolutionary related. But we might face the situation that some subsets are closer (e.g. in terms of sequence identity) to each other than other more distantly related sequences. When counting observed nucleotides, we want to correct this overestimation in the closer subset by giving those sequences a lower weight - right?

The effective sequence number is then the sum of those weights - at least if we follow HMMers definitions: http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf page 58. (btw I think Sean is doing a great job on those statistics - I did some reimplementations if HMMer and Infernal and found their weighting and expectation maximization schemas extremely useful).

If I remember correctly, hmmer uses "number of matching characters / length of shorter sequence" as the distance for two sequences (should be hamming distance, right?) Next step is an UPGMA clustering on the computed distance matrix (which is scipy's linkage with method=’centroid’). Those two steps seem to be as in HMMer.

From my notes (ca. 4 years ago (replace Infernal by HMMer)): After constructing a cluster tree that contains all sequences of MSA, Infernal has to infer sequence weights. At first weights for the leafs (sequences) are just the edge lengths of the single edge that goes to this leaf. These weights are increased by the path length up to the cluster root. Path lengths are weights themselves by the sum of paths in a subtree. In a last step sequence weights w i get normalized to the number of sequences in MSA n:

If we then sum over those weights, we should have a statically sound way of computing the effective sequence number.

What do you think @tkosciol @qiyunzhu ?

rob-knight commented 7 years ago

Pycogent has a fairly large set of sequence weighting mechanisms (Voronoi etc.) that Sandra Smit implemented a few years back. They likely haven’t been ported to skbio but that would be great if someone wanted to take that on. cc:ed Greg to comment.

Rob

On May 30, 2017, at 9:17 AM, Stefan Janssen notifications@github.com<mailto:notifications@github.com> wrote:

Here are my thoughts on the overall problem:

A set of sequences that are supposed to form one alignment are evolutionary related. But we might face the situation that some subsets are closer (e.g. in terms of sequence identity) to each other than other more distantly related sequences. When counting observed nucleotides, we want to correct this overestimation in the closer subset by giving those sequences a lower weight - right?

The effective sequence number is then the sum of those weights - at least if we follow HMMers definitions: http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf page 58. (btw I think Sean is doing a great job on those statistics - I did some reimplementations if HMMer and Infernal and found their weighting and expectation maximization schemas extremely useful).

If I remember correctly, hmmer uses "number of matching characters / length of shorter sequence" as the distance for two sequences (should be hamming distance, right?) Next step is an UPGMA clustering on the computed distance matrix (which is scipy's linkage with method=’centroid’). Those two steps seem to be as in HMMer.

From my notes (ca. 4 years ago (replace Infernal by HMMer)): After constructing a cluster tree that contains all sequences of MSA, Infernal has to infer sequence weights. At first weights for the leafs (sequences) are just the edge lengths of the single edge that goes to this leaf. These weights are increased by the path length up to the cluster root. Path lengths are weights themselves by the sum of paths in a subtree. In a last step sequence weights w i get normalized to the number of sequences in MSA n:

If we then sum over those weights, we should have a statically sound way of computing the effective sequence number.

What do you think @tkosciolhttps://github.com/tkosciol @qiyunzhuhttps://github.com/qiyunzhu ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocore/microprot/issues/26#issuecomment-304930467, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB69KRuIbBr-q3dJPiyHBoqM7bU0mz6Bks5r_EEQgaJpZM4MIWJh.

sjanssen2 commented 7 years ago

most recent example why sequence weights are important: Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance

tkosciol commented 7 years ago

@sjanssen2 First off, apologies for the delay in response. Second, in principle you're right - that is an oversimplification of the problem to some extent. But this specific application has only one purpose - take an MSA and count the number of effective sequences in order to assess whether the domain is "ripe" for de novo structure predictions, i.e. if it has enough non-redundant sequences in a family for contact predictions to be useful. For those protein families which don't have enough sequences, we assume they're not "ripe" yet, and we leave them out for now, in order not to waste CPU time. We will revisit them once we update our sequence databases.

Actual weighting of sequences in MSAs occurs on the contact prediction step, i.e. contact prediction methods, such as PSICOV or GREMLIN have their own sequence weighting schemes already implemented and take advantage of this potentially redundant information, too!

So this is definitely something to look into, but I'd say that looking deeper into this problem is beyond the scope of this particular task.

sjanssen2 commented 7 years ago

I see you point, but I would assume that accessing hamming distance for all pairs is the costly part in this computation. Thus, there might be not too much overhead (in terms of compute time) to obtain precise sequence weights and thus N_eff, compared to you more heuristic approach.

I'd suggest to recycle code from https://github.com/pycogent/pycogent/blob/master/cogent/align/weights/methods.py especially the GSC method

tkosciol commented 7 years ago

agreed! Sounds like a good suggestion. For the time being, we will merge this PR with minor adjustments and I will leave this issue open so that we can make improvements later. Porting the code from PyCogent might be a more involved process.