VanLoo-lab / ascat

ASCAT R package
https://www.mdanderson.org/research/departments-labs-institutes/labs/van-loo-laboratory/resources.html#ASCAT
166 stars 85 forks source link

ASCAT from whole exome bam files #19

Closed crazyhottommy closed 6 years ago

crazyhottommy commented 7 years ago

Hi, I saw in the web page you are developing ASCAT to work on WES bams directly, how is it now?

Or, can you please direct me how to get the logR and BAF from the WES bam files?

Thank you! Tommy

haasek commented 6 years ago

Hi Tommy, Unfortunately, we don’t have a polished pipeline to use sequencing data as ASCAT input just yet, but we are working on an integration for the ASCAT package. I can describe the workflow that we are planning to integrate which has worked quite well so far. Assuming that you start from the bam file, this is how we usually go about it: We use alleleCount (https://github.com/cancerit/alleleCount) to run on bam files, with the SNP loci from the 1000 genomes project. To obtain the logR, we then use the read counts in the tumour sample (=countTumour) and in the normal (=countNormal). In a first step we just divide the tumour by the normal counts: tumour_logR=countTumour/countNormal. Then we transform the value into logarithm while at the same time normalising by the mean, as that helps centering it around 0: log2(tumour_logR/mean(tumour_logR)). The logR of the matched normal is afterwards simply set to 0. As alleleCount returns all counts, the BAF is calculated as the ratio between the maximum count and the complete coverage of the locus. Afterwards you should have tumour and normal LogR and BAF values that you can use in an ASCAT analysis.

After running the steps mentioned above, you have logR and BAF files which you can use in ASCAT analogous to array data. So we just add some pre-processing to create the correct input from the sequencing data, the ASCAT functions stay the same. Just make sure to set the gamma parameter to 1 when working with sequencing data.

Best wishes, Kerstin

isidroc commented 6 years ago

Thanks, Kerstin. So, just to be clear, is this the recommended way to apply ascat on WES data? I have seen that in the ascatNGS repo it is clearly indicated that ascatNGS is not suitable for WES, but no direction on where to find the steps to run ascat on WES is indicated.

Thanks

haasek commented 6 years ago

Hi Isidro, not all exome sequencing projects will lead to satisfying results with the method described above. It depends strongly on sequencing depth and the coverage across the genome, additionally very noisy LogR (e.g. due to GC bias) can be a problem. The initial raw data plots are usually quite telling if it is possible to use a sample for ASCAT analysis. But we have applied pre-processing as described to exome data and got good copy number profiles. I hope to upload the corresponding scripts soon.

Best wishes, Kerstin

messersc commented 6 years ago

Hi all,

I am also playing with ASCAT for one of our WES cohorts and I don't see a reason why it would not work. The key part is probably to pre-process the exomes in a good way, i.e. we will be using the CopywriteR R package - this gives you normalized coverage from the off-target reads of the WES for the LogR files. The BAF files can be generated in the same fashion as for WGS, I guess.

Things I want to test:

haasek commented 6 years ago

Yes, you are right the pre-processing is the important part when preparing WGS and WES for ASCAT. That can include filtering regions that produce very noisy signal, e.g. the HLA locus and correcting for potential GC bias. I'm not familiar with Copywriter, but I would expect that if you input segments that are too large, you will normalise over localised gains and losses. I think it is fine to include off-target reads to achieve a wider coverage of the genome.

xinlingl commented 6 years ago

Hi Kerstin, Is SNP loci from the 1000 genomes project mandatory for input of alleleCount? Can I use SNP loci information from other sources? Thanks a lot!

haasek commented 6 years ago

No, they aren't mandatory. We just need as many heterozyguous SNPs as possible across the genome to derive B-allele frequencies from and we chose the annotations from the 1000 genomes project to get common SNP positions. You can use other annotations to get that information.

xinlingl commented 6 years ago

Ok. Thanks a lot!

xinlingl commented 6 years ago

Hi Kerstin, Why is ascatNGs not suitable for WES analysis? I read its entire documentation but I didn't find the explanation. Thanks! Xinling

xinlingl commented 6 years ago

Hi Kerstin, How is normal_logR calculated? In the example file that's provided in GitHub of ASCAT, there are a lot of items that are non-zero. In the pose above, you said that the logR of the matched normal is afterwards simply set to 0. How about those that don't match? Thanks a lot! Xinling

haasek commented 6 years ago

ASCAT has been designed and optimised to run on arrays and the shift to enable it to run on sequencing data was integrated at a later time. Not every sequencing project is set up with ideal conditions to get clean copy number, though. Especially exome sequencing projects can lack the genome wide coverage to produce a high enough resolution for satisfactory overall copy number. I assume that is why ascatNGS is discouraging use with whole exome data.

xinlingl commented 6 years ago

Hi Kerstin, Thanks a lot! Xinling

haasek commented 6 years ago

Hi Xinling, as mentioned above, the core ASCAT pipeline was designed for array data. So the examples in the current package are supposed to represent logR and BAF values as obtained from, for example, SNP6 data. When producing logR and BAF from sequencing input, you can set the normal logR to zero, as it is not needed downstream.

xinlingl commented 6 years ago

Hi Kerstin, Ok! Thanks. I got the following tables from alleleCount by using normal BAM file and loci generated from somatic mutation vcf file and using tumor BAM file and the loci respectively. To get tumor_logR, what is the read counts in the tumor sample and normal sample in this case? Is the read counts sum of count_A, count_C, countG, and count_T for each position? Also, I think the maximum count is the max of those four counts in each position and complete coverage of the locus is good_depth in this case. In addition, since there are some lines with 5 0s, I believe that I should get rid of those lines before doing the calculation. Am I right? Thanks a lot!
CHR POS Count_A Count_C Count_G Count_T Good_depth chr1 15117 0 7 0 0 7 chr1 16634 0 0 0 0 0 chr1 16682 0 0 0 0 0 chr1 16949 0 0 0 0 0 chr1 17398 9 94 0 0 103 chr1 63304 0 18 0 0 18 chr1 741258 0 31 0 0 31 chr1 762330 0 0 31 0 31 chr1 869091 115 0 0 0 115 …
chrY … … … … … …

CHR POS Count_A Count_C Count_G Count_T Good_depth chr1 15117 0 6 0 1 7 chr1 16634 0 0 0 0 0 chr1 16682 0 0 0 0 0 chr1 16949 0 0 0 0 0 chr1 17398 8 97 0 0 105 chr1 63304 0 2 0 0 2 chr1 741258 0 14 0 0 14 chr1 762330 0 0 43 0 43 chr1 809091 89 0 0 0 89 …
chrY … … … … … …

haasek commented 6 years ago

Hi Xinling, the problem when using only the somatic mutation calls is that ASCAT needs germline heterozyguous SNPs to work on. So I would assume that your somatic mutations will mostly show up as homozyguous in the germline and will thus be excluded downstream. Apart from that, yes, the coverage per position is the sum over the four nucleotide counts which should be identical with the last column for the majority of SNPs. The maximum count is the most abundant nucleotide per row. And you are right, you don't have any information on loci for which you don't have coverage, so it makes sense to exclude those.

xinlingl commented 6 years ago

Hi Kerstin, Why is homozygous mutation removed in ASCAT? Could you explain the biological meaning behind it? I'm using ASCAT to generate the input of PyClone which is a tool to cluster somatic mutation based on major copy number, minor copy number, and allelic counts. I saw some papers using PyClone says that germline variants are removed since they will be assigned to ancestral clone and won't provide additional information in clonal construction, so I only use somatic mutation vcf file and the normal and tumor bam files from whole exome sequencing. Also, in the previous pose, you mentioned that tumor logR needs to be divide by mean_tumor_logR. Is mean_tumor_logR the average logR across all the locus in one sample? Thanks a lot! Xinling

haasek commented 6 years ago

We are using germline heterozyguous SNPs as we can observe deviations in the BAF of the tumour to gain information about gains and losses from them. If you take the germline homzyguous positions, you can't derive any allele specific information from them. To do the clustering in PyClone it makes sense to only work on somatic variants, but for a good copy number profile it helps to run on a large number of SNPs. You can use the VAFs you have now as input and a copy number profile that has been created from a different set of loci. Yes, the mean would be across all positions, to centre the logR around 0.

xinlingl commented 6 years ago

Ok! Thanks a lot!

xinlingl commented 6 years ago

I tried to run ASCAT using Tumor_logR, Tumor_BAF, normal_logR, and normal_BAF from one sample, but I always get an error saying that scripts out of bound. I checked the R code and I find out since there is only one sample, dim() in the R code can't be defined. I set data of second sample to 0s in all four files, but when I run ascat.runAscat(), I always get an error saying m[I,2] out of bounds. I also tried copy the value in sample 1 to sample 2 in all four files, but I get an error when running ascat.aspcf() which says that ghs[sample] scripts out of bound. I also tried changing the ASCATobj$samples[arraynr] in the for loop of runASCAT(), but I still get the error which says m[I,2] out of bound. Could you give me some suggestions about how to run ASCAT on one sample? Thank you very much! Xinling

xinlingl commented 6 years ago

Hi Kerstin, I figured out the problem. Thanks a lot for your reply. I really appreciate your help! :) Xinling

haasek commented 6 years ago

Great, happy to hear that you got it to work! I guess there were some formatting issues in the input data? Just to clarify for others reading here, ASCAT can be run on a single sample as well as multiples.

xinlingl commented 6 years ago

Hi Kerstin, When running ascat on one of the samples, there is no output for it. What could be the possible reason for that? Ascat works for all other samples. Thanks a lot! Sincerely, Xinling

xinlingl commented 6 years ago

I attach S1-BAF.PCFed.txt and S1.LogR.PCFed.txt of that sample below. Thanks! piecture_2 picture_3

haasek commented 6 years ago

Hi Xinling,

do you get an error message for that sample? If ASCAT can't fit a copy number profile that matches the input data well enough, it won't return a result, but it would print that it couldn't find an optimal solution. Also, is that the complete file for the segmented BAF? If there are only 32 SNPs left, it won't be possible for ASCAT to produce a good copy number profile.

Best wishes, Kerstin

xinlingl commented 6 years ago

Hi Kerstin, I don't get any error message for that sample. The output is empty. Yes, the segmented BAF only has 32 SNPs. Is it because most of the SNPs are not germline heterozygous so ASCAT can't produce meaningful result? Thanks a lot! Xinling Li

xinlingl commented 6 years ago

I think I didn't get any error message because the output file contains copy number information of sample 2 but not sample 1. I created sample 2 data by myself and I just want to see the result of sample 1, so I ignored the output about sample 2.

xinlingl commented 6 years ago

Will the number in sample 2 influence the result of sample 1 or they are analyzed separately by ascat? Thanks! Xinling Li

haasek commented 6 years ago

Hi Xinling,

you are right, ASCAT needs germline heterozyguous SNPs for the BAF segmentation. So if you are left with too few after filtering of the homozyguous ones, the downstream analysis won't work. If you are running the standard aspcf function, then each sample is analysed separately. Only if you use asmultipcf, ASCAT will assume that your samples are related and perform a combined segmentation.

Best wishes, Kerstin

xinlingl commented 6 years ago

Hi Kerstin, Ok! I'm using the standard aspcf function. Thanks a lot for your help! Sincerely, Xinling Li

xinlingl commented 6 years ago

Hi Kerstin, For the output of ASCAT, the majority of rounded minor copy number are 0. Is it because after rounding, the small number all round to 0 or is there any other issues for my input? Thanks! Sincerely, Xinling Li

xinlingl commented 6 years ago

Hi Kerstin, In the previous pose, you said that the initial raw data plots are usually quite telling if it is possible to use a sample for ASCAT analysis. What are the x and y axis of the plot? What does the plot look like if the sample is ok for ASCAT analysis? Could you provide an example plot? Thanks a lot! Sincerely, Xinling Li

haasek commented 6 years ago

Hi Xinling, without seeing the raw data plots, it's hard to say if there are any issues with your input data. If you have a lot of segments with a minor copy number of 0, it just means that for that section one allele was lost. It should be visible in the BAF plot that you have many homozyguous stretches.

The raw data plots show the SNPs along the genome on the x-axis and the logR signal and B-allele frequency on the y-axis for the top and bottom plot, respectively. Have you followed the example pipeline provided with the package? It should produce quite a few raw data plots. In short, the logR shouldn't be too noisy, i.e. gains and losses should be visible and for the BAF the separate bands should be recognisable.

Best wishes, Kerstin

xinlingl commented 6 years ago

Hi Kerstin, Thanks a lot! Also, is the major copy number in ascat the major parental copy number predicted from the tumor sample and minor copy number in ascat the minor parental copy number predicted from the tumor sample? I'm wondering if I can directly use the output of ascat as input of PyClone, since I'm not sure whether the definition of major and minor copy number are the same. Thanks! Sincerely, Xinling Li

xinlingl commented 6 years ago

Also, I noticed that for some segments, both major and minor copy number are 0. What could be the possible reason for that? Thanks a lot!

haasek commented 6 years ago

Hi Xinling, major and minor copy number do reflect the amount of the two alleles in the tumour sample. I'm not familiar with the input requirements of PyClone, but if it requires the major and minor clonal copy number for genomic segments, then the ASCAT output should be fine. Please be aware that ASCAT doesn't infer subclonal copy number, i.e. mixtures of multiple copy number states. If a segment has zero copies, that would be a homozygous deletion. These are possible, but if the copy number profile contains a large fraction of them, it might be worth checking the input data. Best wishes, Kerstin

xinlingl commented 6 years ago

Hi Kerstin, Ok! Thanks a lot! Sincerely, Xinling Li

xinlingl commented 6 years ago

Hi Kerstin, Is normal_BAF needed in the downstream analysis of ASCAT? I want to run my pipeline on simulation data, but the data only contains tumor_logR and tumor_BAF. In the previous pose, you said that we can set normal_logR to 0. How about normal_BAF? Thanks a lot! Sincerely, Xinling Li

haasek commented 6 years ago

Hi Xinling, ASCAT works on germline heterozyguous SNPs, so it needs the normal BAF to determine which SNPs to select. Best wishes, Kerstin

kentsing commented 5 years ago

hi Kerstin If i don't have a matched normal sample, can i calculated logR and BAF from the sequencing data?

haasek commented 5 years ago

Hi, unfortunately, the current approach to derive LogR and BAF from sequencing data relies on having a matched normal sample. We currently only have an option to predict the germline genotype for SNP array data. Best wishes, Kerstin

dingxm commented 5 years ago

Hi haasek, We are tyring to analyze exome data for copy number. The exome data has extended probes covering regions outside of the exomes. It seems that last year ASCAT could not use exome data without pre-processing, is this still true or is there a version that deals with Exome data now?

Also, can pooled normals be used in place of matched normal? If so, how could this be done.

Thank You,

haasek commented 5 years ago

Hi Dingxm,

we are still working on the integrated exome version of ASCAT. But in general, it will be an implementation of the above described approach. Unfortunately, you will need a matching germline sample to derive correct LogR and BAF values for the downstream analysis and can't use a pool of normals.

Best wishes, Kerstin

vyellapa commented 5 years ago

Hello Kerstin, Thank you for all your responses. In the LogR calculation, the suggestion to use tumour_logR=countTumour/countNormal does not account for differences in sequencing depth or GC content. Does ASCAT account for this later on? If not how could I calculate the LogR values to account for this?

Thanks, Teja

yftang commented 4 years ago

Hi Tommy, Unfortunately, we don’t have a polished pipeline to use sequencing data as ASCAT input just yet, but we are working on an integration for the ASCAT package. I can describe the workflow that we are planning to integrate which has worked quite well so far. Assuming that you start from the bam file, this is how we usually go about it: We use alleleCount (https://github.com/cancerit/alleleCount) to run on bam files, with the SNP loci from the 1000 genomes project. To obtain the logR, we then use the read counts in the tumour sample (=countTumour) and in the normal (=countNormal). In a first step we just divide the tumour by the normal counts: tumour_logR=countTumour/countNormal. Then we transform the value into logarithm while at the same time normalising by the mean, as that helps centering it around 0: log2(tumour_logR/mean(tumour_logR)). The logR of the matched normal is afterwards simply set to 0. As alleleCount returns all counts, the BAF is calculated as the ratio between the maximum count and the complete coverage of the locus. Afterwards you should have tumour and normal LogR and BAF values that you can use in an ASCAT analysis.

After running the steps mentioned above, you have logR and BAF files which you can use in ASCAT analogous to array data. So we just add some pre-processing to create the correct input from the sequencing data, the ASCAT functions stay the same. Just make sure to set the gamma parameter to 1 when working with sequencing data.

Best wishes, Kerstin

Hi Kerstin,

Does it make any sense to only require SNPs in the exome region? In my case, I ran bedtools intersect to exclude 1000G SNPs outside of Agilent V6 target region.

But after calculating to generate BAF and LogR files according to your suggestion, although most samples passed the process and result in certain ploidy and purity figures, some fails. These are the *.germline.png and *.tumor.png files of one of failed samples.

A1 tumor

A1 germline

and these are of passed samples.

L1 germline L1 tumor

I don't see significant difference between these two samples, could you please help?

Error message is shown as:

In runASCAT(lrr, baf, lrrsegm, bafsegm, ASCATobj$gender[arraynr],  :
  ASCAT could not find an optimal ploidy and cellularity value for sample A1.
tlesluyes commented 4 years ago

Hi,

Makes total sense to remove SNPs outside of regions that have been sequenced (like whole-exome).

However, something is going wrong with your BAF track here because you don't have any point from 0 to 0.5. From alleleCounter output, you may have used the 'maximum count' as indicated in a previous post (which can be confusing), but, instead, you have to use ref and alt information from 1000G. BAF is refined by alt/(alt+ref) ratio so you have to compute such information and give it to ASCAT. Until this has been corrected, please don't interpret any results with these profiles, even though ASCAT provided solutions for some of those.

Cheers,

Tom.

yftang commented 4 years ago

Ahh! thanks Tom,

I could have made way bigger mistake. I will try alt/(alt+ref) to generate BAF.

Best, Sam

Krishna4369 commented 3 years ago

Hi, I am getting below message from ASCAT Warning message: In runASCAT(lrr, baf, lrrsegm, bafsegm, ASCATobj$gender[arraynr], : ASCAT could not find an optimal ploidy and cellularity value for sample 11797_R.

My generated raw plots are 11797_R tumour 11797_R ASPCF 11797_R sunrise

I am getting blank ASCAT output file can we conclude anything from the images generated?

wolverinelei commented 1 year ago

Hi Tommy, Unfortunately, we don’t have a polished pipeline to use sequencing data as ASCAT input just yet, but we are working on an integration for the ASCAT package. I can describe the workflow that we are planning to integrate which has worked quite well so far. Assuming that you start from the bam file, this is how we usually go about it: We use alleleCount (https://github.com/cancerit/alleleCount) to run on bam files, with the SNP loci from the 1000 genomes project. To obtain the logR, we then use the read counts in the tumour sample (=countTumour) and in the normal (=countNormal). In a first step we just divide the tumour by the normal counts: tumour_logR=countTumour/countNormal. Then we transform the value into logarithm while at the same time normalising by the mean, as that helps centering it around 0: log2(tumour_logR/mean(tumour_logR)). The logR of the matched normal is afterwards simply set to 0. As alleleCount returns all counts, the BAF is calculated as the ratio between the maximum count and the complete coverage of the locus. Afterwards you should have tumour and normal LogR and BAF values that you can use in an ASCAT analysis.

After running the steps mentioned above, you have logR and BAF files which you can use in ASCAT analogous to array data. So we just add some pre-processing to create the correct input from the sequencing data, the ASCAT functions stay the same. Just make sure to set the gamma parameter to 1 when working with sequencing data.

Best wishes, Kerstin

Hi, Kerstin You say that tumor_logR=countTumour/countNormal. I try to find a way to caculate logR without Normal bam files .Could you give me some advice? Best wishes, Yang

tlesluyes commented 1 year ago

Hi @wolverinelei,

With regard to sequencing data, ASCAT only processes T/N pairs. This is because deriving logR from tumour-only and for individual SNPs is really tricky and would require a bespoke methodology that we don't have in place at the moment. That being said, there are a few methods out there that are able to derive logR from tumour-only but have a binning step. If you'd be happy with that, you might consider trying our ASCAT.sc tool: https://github.com/VanLoo-lab/ASCAT.sc (documentation under the vignettes folder).

Cheers,

Tom.