ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
109 stars 33 forks source link

polyclonal or monoconal model from river plot #97

Closed ada6090 closed 1 year ago

ada6090 commented 1 year ago

Hi I am wondering what is the best way to visualize the clones in a tree in order to accurately assess of the evolution or monoclonal or polyclonal. I tried to check the stories Rdata and checked the tree but only shows all the previous parents for each clone or maybe I can interoperate that from the clonalities values but I am not really sure how? Thanks in advance for the help

ChristofferFlensburg commented 1 year ago

Hi

Yep that information is hidden in

stories$stories$myINDIVIDUAL$consistentClusters$cloneTree

which is a nested list representing the nested clones which defines the phylogeny. As in, a sub-list is a sub-clone. This is also represented in the river plots, where the shapes of subclones are placed within the shapes of the containing clones.

Best way to visually a phylogeny in general is a bigger question than a github issue though. There are many options, but this isn't the place for that discussion.

maheshworpaudel5001 commented 1 year ago

What are the numbers provided for each sample in clonestories.tsv? I tried to match the mean of the clonality for each sample for a particular clone and it did not match. For eg: This is a part of the clonestories.tsv file. "call" "x1" "x2" "stories.brmet018.1.TUMOR" "stories.brmet018.2.TUMOR" "stories.brmet018.3.TUMOR" "stories.brmet018.NORMAL" "errors.brmet018.1.TUMOR" "errors.brmet018.2.TUMOR" "errors.brmet018.3.TUMOR" "errors.brmet018.NORMAL" "germline" "germline" NA NA 1.0000000000 1.0000000000 1.0000000000 1.0000000000 0.000000000 0.000000000 0.000000000 0.000000000 "1" "clone" NA NA 0.9331292403 0.9729403067 0.8971861589 0.0000000000 0.074514786 0.083448688 0.102813841 0.038155813 "2" "clone" NA NA 0.7410914912 0.7855776870 0.5641542102 0.0005927532 0.048251091 0.054535202 0.035836868 0.031930887 "3" "clone" NA NA 0.3333948060 0.3457638553 0.2457406105 0.0001065781 0.008011892 0.008576646 0.007374893 0.002011506 "4" "clone" NA NA 0.2614028213 0.3624337906 0.0116547500 0.0000000000 0.028613919 0.034979801 0.010097572 0.007012154 "5" "clone" NA NA 0.0006424234 0.0007419056 0.2437522647 0.0001234505 0.002643772 0.002965609 0.011624705 0.002753008 "6" "clone" NA NA 0.0670725909 0.0623788856 0.0041156403 0.0000000000 0.031956929 0.036373523 0.014546472 0.013595052

If I list all of the clonality of "stories.brmet018.1.TUMOR" for clone='1' these are the values: array([1. , 0.88535996, 0.9356897 , 1. , 1. ,

  1. , 1. , 1. , 0.48766686, 1. , 0.61865196, 1. , 0.85248031, 1. , 0.92466943, 0.52755911, 1. , 1. ])

Now how do I go from this data to a single number 0.9331292403 in this case which is listed as clonality of "stories.brmet018.1.TUMOR" in the tsv file?

I tried testing if the mean of the array is equal to this number, it is not. Help needed!

ChristofferFlensburg commented 1 year ago

I don't know what that file is, you need to tell me how you generated it. :D It looks like the clonalities of the clones across samples, but I don't know how that file was generated.

Each clone has a single clonality for each sample, and those seem to be the numbers listed here. 🤷

maheshworpaudel5001 commented 1 year ago

I have tumor samples sequenced in 3 different regions. So, I have 3 samples called brmet018.1.TUMOR, brmet018.2.TUMOR and brmet018.3.TUMOR. I have a normal sample called brmet018.normal. I included these files in my metadata.tsv. I also provided 10 other normal samples that were sequenced exactly the same way. With that, I ran superfreq program. After the program completes, I went inside the plots/brmet018/rivers directory where there are pdf, excel and csv file of the clone information. I opened brmet018-river.xls file and extracted out the clonalities for clone='1' identifier from the brmet018.1.tumor sample. These are the values of the clonality obtained: array([1. , 0.88535996, 0.9356897 , 1. , 1. , 1. , 1. , 1. , 0.48766686, 1. , 0.61865196, 1. , 0.85248031, 1. , 0.92466943, 0.52755911, 1. , 1. ])

Then I went to plots/brmet018/data where I got the clones_brmet018.tsv whose file content is the following: ID stories.brmet018.1.TUMOR stories.brmet018.2.TUMOR stories.brmet018.3.TUMOR stories.brmet018.NORMAL errors.brmet018.1.TUMOR errors.brmet018.2.TUMOR errors.brmet018.3.TUMOR errors.brmet018.NORMAL germline 1.0000000000 1.0000000000 1.0000000000 1.0000000000 0.000000000 0.000000000 0.000000000 0.000000000 1 0.9331292403 0.9729403067 0.8971861589 0.0000000000 0.074514786 0.083448688 0.102813841 0.038155813 2 0.7410914912 0.7855776870 0.5641542102 0.0005927532 0.048251091 0.054535202 0.035836868 0.031930887 3 0.3333948060 0.3457638553 0.2457406105 0.0001065781 0.008011892 0.008576646 0.007374893 0.002011506 4 0.2614028213 0.3624337906 0.0116547500 0.0000000000 0.028613919 0.034979801 0.010097572 0.007012154 5 0.0006424234 0.0007419056 0.2437522647 0.0001234505 0.002643772 0.002965609 0.011624705 0.002753008 6 0.0670725909 0.0623788856 0.0041156403 0.0000000000 0.031956929 0.036373523 0.014546472 0.013595052

If I see the bottom data i.e. the clones_brmet018.tsv, for clone='1' identifier, for sample 'brmet018.1.TUMOR', the clonality is 0.9331292403. Now my question is how is this number obtained? And how do I go from using this array array([1. , 0.88535996, 0.9356897 , 1. , 1. , 1. , 1. , 1. , 0.48766686, 1. , 0.61865196, 1. , 0.85248031, 1. , 0.92466943, 0.52755911, 1. , 1. ]) to get this number 0.9331292403?

ChristofferFlensburg commented 1 year ago

The .xls in the river has one line for each mutation, what clone the mutation is assigned to, and that mutations estimated clonality across all samples. That is based on VAF and local copy number for SNVs, and on LFC and BAFs for CNAs.

The .tsv in data has one line for each clone, and the clonality of the clone is a weighted (inverse square of error) average of the anchor mutations defining the clone. Have a look in the methods section of the paper for more information on how clonal tracking is done in superFreq.

This might be an XY problem, maybe take a step back and explain what you are trying to do and why?

maheshworpaudel5001 commented 1 year ago

Ok. I am confused with the definition of clonality in the paper. The paper at many places define clonality as cellular fraction of the sample. And I found it quite vague to understand. Here is what I understood, and please correct me if I am wrong: Imagine tumor sample which have 3 different clones and all are disjoint (to make our life simpler). They are tagged as '1', '2' and '3'. Now consider that in the samples we find, n1, n2 and n3 counts of each clones. Then the clonality for clone '1' is defined as: clonality1 = n1 / (n1 + n2 + n3) and so on for the rest. Is that it?

Similarly, in the excel/csv files where clonality for each mutation for each type of clones in each samples is given, the clonality of the given mutation in a particular chromosome at a particular position in a particular type of clone is the fraction of such mutations occurring for that clone across all samples. For eg: say in chromosome 1 at position 12000, a mutation C>A was found 300 times across all samples and this was categorized as clone type 1. The its clonality would be 300/(I do not know what to put here.)

These are the reasons I need the informations:

  1. I want to know the actual fraction(and maybe absolute counts) of the clones that were detected.
  2. I want to use this information later on to predict neoantigens using tools like pvactools/pvacseq.
  3. And finally create a simulation of the tumor ecosystem and model that to predict survivability of individual patients.
ChristofferFlensburg commented 1 year ago

Ok, so some confusion on what clonality is maybe.

A clone is a cell population originating from a single founder cell, that will all share the mutations of the founder cell. The clonality of a clone in a samples is the fraction of cells in the sample that originate from that founder cell. It is calculated through the clonality of the mutations defining the clone, ie the mutations of the founder cell.

The clonality of a mutation in a sample is the fraction of cells with that mutation. For example a SNV with a variant allele frequency of 0.25 in a diploid region will be assigned a clonality of 0.5 (because 0.5 of the cells, with one ouf of two alleles mutated gives a VAF of 0.25).

The phylogeny, as in which are subclones and which are disjoint, is a separate algorithm done later, based on the clonalities of the clones are calculated.

Does that help?

maheshworpaudel5001 commented 1 year ago

It would be very helpful if you could point me in the direction of calculating the fraction of cells of particular clone type in the samples from the clonality of each mutation (plots/patientID/rivers/patientID-river.tsv) or from the clonality of each clones (plots/patientID/data/clones_patientID.tsv). For eg: I need data in the form something like this: This is fake example: Clone Fraction 1 0.85 (0.85) 2 0.80 (0.12) 3 0.20 (0.3)

Here I assumed the following phylogeny: 1 -> [2, 3] where 2 and 3 are disjoint. In the example, in the fraction column, the fractions are the fraction of cells of that type of clone for a clone family while that inside the bracket are the fraction in the overall sample. So, from the example, I would infer, 85% are pure clone 1 while remaining 15% are either clone 2 or clone 3. 80 % of the remaining 15 % are clone 2, and so the overall fraction for clone 2 is 12 %. The calculation for clone 3 is same as clone 2.

I am looking forward to do a simulation in a finite dimension, where I take some number of cells. Say if the data were the synthetic data I created, I would put 85 cells of clone 1 type, 12 clone 2 type and 3 clone 3 type in my simulation.

By the way, in short, I meant to calculate the cancer cell fraction from the clonality values?

ChristofferFlensburg commented 1 year ago

This is all explained in the superFreq paper linked from the README, and in the manual here on github. I think there is even a youtube video of a seminar linked in the readme, whereI talk about clonal tracking in superFreq, but not sure I exactly answer your concern. Note that all clonalities are sample cell fractions, not cancer cell fractions, as superFreq does not explicitly calculate a cancer purity.

the clonalities of the individual mutations and the clonalities of the clones are in the files described above.

Good luck!