PoonLab / vindels

Developing an empirical model of sequence insertion and deletion in virus genomes
1 stars 0 forks source link

Acute / chronic N-glyc sites #74

Closed jpalmer37 closed 5 years ago

jpalmer37 commented 5 years ago

Context: I downloaded all available gp120 sequence data on LANL that contained one of three tags (acute, chronic, AIDS). I created a MSA with the conserved regions, generated a RAxML tree, pruned this tree down to 50% of sequences, and extracted all glyc sites from these sequences.

You mentioned previously that I could simply look at the distributions of N-glyc counts in acute and chronic patients.

> summary(acute$count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17.00   23.00   25.00   24.48   26.00   29.00 
> summary(chronic$count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.00   23.00   25.00   24.44   26.00   30.00 

Apart from the lowest counts in the chronic group, acute and chronic sequences in this data set appear to have little to no difference in their distribution of glycosylation site counts.

Is there anything you'd like me to test with this current data set? Or a different course of acute/chronic patient data you can think of?

ArtPoon commented 5 years ago

The minimum counts are always going to be variable and sensitive to sample size. I would start with a simple visualization of PNGS frequencies by site (two bar plots, acutes above the horizontal axis and chronic/AIDS below).

jpalmer37 commented 5 years ago

The figure you requested previously: nglyc-frequencies

And a scatter plot comparing frequencies of N-glyc sites common to both acute and chronic cases. (N-glyc sites unique to one of the groups thus falling along the x and y axes were excluded for now) nglyc-scatter

ArtPoon commented 5 years ago
ArtPoon commented 5 years ago

I'm really skeptical about the barplot - neither of the papers from the CAPRISA/CHAVI group (subtype B, subtype C) mention these fixed differences between acute and chronic populations

jpalmer37 commented 5 years ago

You have a good point. Wouldn't make sense that they missed something so prominent. I'll double-check my algorithm for making that plot and visually see whether these sites are in fact different in essentially all cases.

ArtPoon commented 5 years ago

Also check that their sequences are in your data sets - if they don't find stark differences between acutes and chronics in their data, and you have those sequences in YOUR data, then it shouldn't be possible for you to see 0% of a PNGS in acute and 100% in chronic.

jpalmer37 commented 5 years ago

I figured it out. Really silly mistake on my part. I set my ylim parameter to max out at 250 when the max count for acute was 257. Four bars were excluded because they got cut off.

Here's the figure of the fixed N-glyc counts: nglyc-counts

and the fixed scatterplot (added in values unique to one group, fixed the denominator): nglyc-scatter

jpalmer37 commented 5 years ago

Will pass on to Adam for reference data.