About the output. - Githubissues

b-niu commented 4 years ago

Dear Professor Roth,

There are some questions about the output file.

In the README, it's mentioned that,

cellular_prevalence_std - Standard error of the cellular_prevalence estimate.

I think, generally speaking, std means standard deviation. Is it standard error in the case of pyclone-vi?

If we are going to calculate the 95% CI of the CCF of a mutation, should we calculate it as:
```
cellular_prevalence + 1.96 * cellular_prevalence_std
```
or
```
cellular_prevalence + 1.96 * cellular_prevalence_std / sqrt(size)
```
“size" is the size of each cluster_id.
I have noticed that, in the output of pyclone-vi, every mutation_id in the same cluster_id shared the same CCF and std, which is different with pyclone's sitution. Is that how it's designed?

Here are two examples.

PyClone-Vi

1 mutation_id     sample_id       cluster_id      cellular_prevalence     cellular_prevalence_std cluster_assignment_prob
      2 chr10_120877246_A       R010_TR 0       0.9840  0.0156  0.2947
      3 chr10_1284136_C R010_TR 0       0.9840  0.0156  0.4006
      4 chr10_97684075_A        R010_TR 0       0.9840  0.0156  0.9579
      5 chr11_118243762_T       R010_TR 0       0.9840  0.0156  0.3202
      6 chr11_31703559_A        R010_TR 0       0.9840  0.0156  0.9918
      7 chr11_541549_C  R010_TR 0       0.9840  0.0156  0.3006
      8 chr11_57564176_G        R010_TR 0       0.9840  0.0156  0.6551
      9 chr11_64507462_A        R010_TR 0       0.9840  0.0156  0.9929
     10 chr11_66373098_C        R010_TR 0       0.9840  0.0156  0.6878
     11 chr11_82877429_G        R010_TR 0       0.9840  0.0156  0.9945
     12 chr11_82877430_A        R010_TR 0       0.9840  0.0156  0.9925
     13 chr12_110891640_C       R010_TR 0       0.9840  0.0156  0.8822
     14 chr12_49421501_C        R010_TR 0       0.9840  0.0156  0.6770
     15 chr12_49421502_A        R010_TR 0       0.9840  0.0156  0.6770
     16 chr12_57624480_C        R010_TR 0       0.9840  0.0156  0.9985
     17 chr13_21439555_T        R010_TR 0       0.9840  0.0156  0.2805
     18 chr14_102461723_C       R010_TR 0       0.9840  0.0156  0.6659
     19 chr14_103109576_A       R010_TR 0       0.9840  0.0156  0.8032

PyClone:

1 mutation_id     sample_id       cluster_id      cellular_prevalence     cellular_prevalence_std variant_allele_frequency
      2 chr10_100401647_C       R003_TL8        0       0.18071347441715985     0.016903199731462922    0.06382978723404255
      3 chr10_101639648_C       R003_TL8        0       0.17982105619361255     0.015457517029176877    0.05825242718446602
      4 chr10_101640057_G       R003_TL8        0       0.1818155689181625      0.02053535541170497     0.06666666666666667
      5 chr10_103454403_C       R003_TL8        0       0.17823996988723376     0.015462692424096456    0.04918032786885246
      6 chr10_104128994_T       R003_TL8        0       0.2127829659110728      0.07472412308509058     0.13333333333333333
      7 chr10_104129015_T       R003_TL8        0       0.2187090823368248      0.08078620686140944     0.13953488372093023
      8 chr10_104836814_C       R003_TL8        0       0.2546815776966093      0.10477168311907836     0.16666666666666666
      9 chr10_105160212_C       R003_TL8        0       0.18108585561346474     0.014686502764975476    0.07333333333333333
     10 chr10_105885264_G       R003_TL8        0       0.22907172278966256     0.08973299189257782     0.15
     11 chr10_105885268_C       R003_TL8        0       0.20204729995466708     0.06216311605096555     0.11904761904761904
     12 chr10_105956703_G       R003_TL8        1       0.4397005742926062      0.07706271883953303     0.25
     13 chr10_105956709_C       R003_TL8        1       0.3856536712062231      0.09494088610156436     0.21311475409836064
     14 chr10_112581108_G       R003_TL8        0       0.19207114875896375     0.04454604366819066     0.10344827586206896
     15 chr10_11308595_G        R003_TL8        0       0.18812766472747872     0.036868029346953456    0.09090909090909091
     16 chr10_114910828_C       R003_TL8        0       0.18290222729160732     0.022546855377296227    0.0759493670886076
     17 chr10_115664633_C       R003_TL8        0       0.18959340326205318     0.03908659880473149     0.1
     18 chr10_118891744_C       R003_TL8        0       0.17147225370889752     0.024161940801417633    0.03389830508474576
     19 chr10_119798701_A       R003_TL8        0       0.18078720837352374     0.01691467466343686     0.06451612903225806
     20 chr10_12056042_T        R003_TL8        0       0.18471136799246454     0.0274178450380843      0.08333333333333333
     21 chr10_12056078_T        R003_TL8        0       0.1845903842023367      0.026972653638206962    0.08333333333333333
     22 chr10_123256076_G       R003_TL8        0       0.2194958894536334      0.0823895430972849      0.14285714285714285

aroth85 commented 4 years ago

It is the standard deviation i.e. square root of the variance. This is computed based on the posterior distribution of CCF for the cluster.
It would be cellular_prevalence + 1.96 * cellular_prevalence_std. That assumes the variables follow a Gaussian, which they likely don't. The posterior maybe multi-modal for example, though that is rare if the cluster has more than two mutations assigned. Probably better just to think of the standard error as relative measure of confidence to compare estimates between clusters.
This is expected. The CCF quoted is the mean value of the cluster the mutation is assigned to. This differs from PyClone where we compute the mean value of the CCF across the MCMC samples. The latter better represents uncertainty over clustering, but I suspect it makes little difference in practice.

b-niu commented 4 years ago

Thank you, professor. My confusion has been answered.

Best regards, Bing

Roth-Lab / pyclone-vi

About the output. #2