evogytis / fluB

Investigating the (co)evolution of reassorting influenza B lineages.
4 stars 0 forks source link

LD and temporal sampling #28

Closed trvrb closed 10 years ago

trvrb commented 10 years ago

Reading your paragraph on D' made me question the LD analysis a bit:

Your analysis of ΔTMRCA nicely controls for differences in sampling date between tips, however, the LD analysis just lumps everything together. Let's say we start with ab and then Ab fixes and then AB fixes, so that most early samples are ab and most late samples are AB. This will appear as strong LD between loci A and B, even though it's just temporal sampling. Would it make sense to bin years before calculating LD and then average across years?

evogytis commented 10 years ago

Yeah, I've thought about this before (hence the option to timeslice the alignment in the LD calculator). Three points on this:

  1. In the example you give only D' would give strong LD (because aB is not be observed), r^2 (which is roughly what our statistic is) will give LD that depends on the allele frequencies, for example if the transitional form Ab in your example is rare LD will be higher because ab and AB will be more frequent (which is close to what complete LD looks like). I think because r^2 quantifies P(a|b) it makes it resilient to temporal dynamics, as long as alleles arise relatively close in time to each other.
  2. There's a problem of visualization - I have a script that can produce animated heatmaps and I could animate the LD circles over time, which I bet would look really cool, but would make it difficult to include in a publication in a seamless way. Averaging seems like a good idea,
  3. By not timeslicing we are saying that the side branches are not that interesting - they're excluded either because they don't persist long enough to accumulate amino acid substitutions or if they do because they are below the 1% minor allele frequency cutoff (in a dataset of 1600 sequences it means that there have to be at least 16 sequences with a particular allele to calculate LD). Partitioning the sequences into time bins will be sensitive to whatever's been sampled within the bin and will also yield high LD between site pairs that belong to divergent lineages, which is not that interesting. Keeping the data as they are gives us an idea of what's persisted for long periods of time and you can think of it as the LD between trunks of each phylogenetic tree.

Having said that point 2 sounds very appealing to me. We could try to address the problems of binning in point 3 by raising the minor allele frequency cutoff and maybe doing sliding window LD rather than time binned LD. I imagine it would look like a plasma globe - a handful of sites on PB1, PB2 and HA will maintain strong links in the data whilst others belonging to transient lineages will appear and fade as you go forwards in time. I'm fine with either binning or keeping the data as it is, if you're more comfortable with binned data I can re-run the LD analyses.

trvrb commented 10 years ago

I'm not surprised you've already given this some thought. I'm happy to proceed however you'd prefer. Two best scenarios that I see:

  1. Keep things exactly as they are, focusing on LD between pairs of sites expressed via the circle network plot.
  2. Switch to per year LD measurements. As you saw, visualizing pairwise relationships between sites will be difficult. What about averaging across sites for a particular year and for a particular pair of segments. So, for example, you'll have average r2 for PB1/PB2 in 2002. This would basically recapitulate Figure 4.

I'm not sure if recapitulating Figure 4 or expressing per site LD is better.

evogytis commented 10 years ago

I've run the 1600 dataset through the LD calculator by taking sequences from a sliding window (window size = 3 or 4 years, moved by a year each time). Each comparison in the heatmap contains the mean of LD across all time slices, but it looks like a mess. There's no noticeable LD between any sites or it's in the wrong place. I'll check the code more closely tomorrow to see if I'm not messing something up. Also managed to get the LD plasma ball animation/plot to work. It looks good but I can also see a few interesting things I can't see in the heatmaps - for example you can notice reassortment events. In the time frame when the B/Waikato/6/2005-like PB1+2 / HA reassortants exist LD between PB1+2 and HA decays and then recovers after the reassortants go extinct. What I think it also shows is that what we get out of timeslicing the LD results will be very sensitive to sampling regime.

So I think it's best to leave the LD results as they are. I'd be willing to argue this with the reviewers (if they decide to pick it up) and I got a cool-looking animation for future presentations out of it.

trvrb commented 10 years ago

Sounds good. I like your plan. Thanks for going to the trouble to look into this. Closing.