Handling data sets that failed BEAST analysis

jpalmer37 commented 5 years ago

1) Excessive population sizes:

I had to remove two patient data sets at the BEAST analysis stage of the pipeline due to their very large number of sequences (one of which was Vlad Novitsky's patient data set). Can anything be done to edit these data sets so they can successfully complete an analysis in BEAST?

2) Bad traces:

Two additional data sets had very poor trace outputs (despite having relatively manageable N sizes; 101 and 114). Is there anything you'd like me to diagnose or check regarding these data sets?

ArtPoon commented 5 years ago

My usual trick is to fix the tree topology to the ML reconstruction - does the analysis still fail to converge?

If so, then you could random sub-sample the data.

For the bad traces, how much variation is there in the alignments? Could be a lack of signal in the data. Try comparing your traces to the prior distributions.

On Jun 24, 2019, at 5:20 PM, John Palmer notifications@github.com wrote:

Excessive population sizes: I had to remove a total of 2 patient data sets at the BEAST analysis stage of the pipeline due to their very large number of sequences (one of which was Vlad Novitsky's patient data set).

Can anything be done to edit these data sets so they can successfully complete an analysis in BEAST?

Bad traces: Two other data sets had very poor trace outputs (despite having relatively manageable N sizes; 101 and 114). Is there anything you'd like me to diagnose or check regarding these data sets?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PoonLab/vindels/issues/71?email_source=notifications&email_token=AAIO2UDCMSGW4C3SIII34TLP4E3BVA5CNFSM4H3CNOZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G3MS7DA, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIO2UHGXPH7FBCMYXGBVUTP4E3BVANCNFSM4H3CNOZA.

jpalmer37 commented 5 years ago

I checked the alignments of these 4 problematic cases and the variation appeared roughly the same as the regular cases that finished normally. However, this variation is spread across 10 times the number of sequences in these problematic cases (~1000 sequences). So the variation could be insufficient to give a signal.

I ran these 4 sets on the cluster with a population size of 3 and a fixed tree topology (which we deemed the best settings for BEAST analysis). These are the traces I retrieved from them. Are there other options you'd like to try?

Screenshot from 2019-07-18 11-45-09

ArtPoon commented 5 years ago

Trees are not properly rooted. Use rtt to reroot:

require(ape)

tr <- read.tree('~/git/vindels/beast/temp.nwk')

# get tip dates
temp <- sapply(tr$tip.label, function(x) strsplit(x, '_')[[1]][2])
tip.dates <- as.integer(unlist(temp))

tr1 <- rtt(tr, tip.dates)

and re-run BEAST analyses with these re-rooted trees.

ArtPoon commented 5 years ago

I think we forgot to screen these data sets for temporal signal using TempEST or rtt. Can you please go back to the rtt results and see if there is a consistent positive correlation between divergence from the root (y-axis) and sampling times (x-axis)? This may explain why we are having trouble with some traces, where the size of the tree explodes and the clock rate collapses, or vice versa.

jpalmer37 commented 5 years ago

A root-to-tip step was incorporated into our analysis to address this step. Out of 50 patient data sets, 17 were found to have insufficient root-to-tip signal, leaving a remaining 33 data sets for analysis. With this step in place, BEAST results are far more promising and show strong convergence.

ArtPoon commented 5 years ago

Note we have the option of going back to these 17 data sets with a fixed clock rate.

PoonLab / vindels

Handling data sets that failed BEAST analysis #71