WaveCNV / WaveCNV-caller

Cancer specific CNV caller for Next Generation sequence
3 stars 2 forks source link

How to run with --cfrac 0.8 #3

Open sicotte opened 10 years ago

sicotte commented 10 years ago

When I specify the input purity, it forces the "cell" option. The error message says

You need more than 8 data samples. The number of data points must satisfy the relation N > 2xK**2 where K is the number of clusters. The smallest value for K is 2. at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 5204.

In my version, line 5204 is my ($ids, $centers) = $kmeans->kmeans();

I do have multiple samples.. so how do I specify the input files to get this to work.

bin/cnv_caller.pl $seg $vcf --sid s_tumor.100 --bam_list $tbam --rid s_normal.100 --rbam_list $nbam --lfrag 200 --merge --smooth --tmp $thisdir/tmp.100 --cfrac 0.8

carsonhh commented 10 years ago

Is this by any chance exome data? The error is basically caused by there not being enough variants to make needed global calculations work.

--Carson

From: sicotte notifications@github.com Reply-To: WaveCNV/WaveCNV-caller <reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git hub.com> Date: Monday, March 31, 2014 at 2:47 PM To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com Subject: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

When I specify the input purity, it forces the "cell" option. The error message says

You need more than 8 data samples. The number of data points must satisfy the relation N > 2xK**2 where K is the number of clusters. The smallest value for K is 2. at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 5204.

In my version, line 5204 is my ($ids, $centers) = $kmeans->kmeans();

I do have multiple samples.. so how do I specify the input files to get this to work.

bin/cnv_caller.pl $seg $vcf --sid s_tumor.100 --bam_list $tbam --rid s_normal.100 --rbam_list $nbam --lfrag 200 --merge --smooth --tmp $thisdir/tmp.100 --cfrac 0.8

— Reply to this email directly or view it on GitHub https://github.com/WaveCNV/WaveCNV-caller/issues/3 .

sicotte commented 10 years ago

It's 30X whole genome called using GATK. There are ~600-1.8M variants/sample. I commented out the temp file.. and it only puts 5 segments in the file. I'm trying to find where in estimate_copy_fraction and discovery_segments functions the segments are filtered to see if there is a cutoff I can change.

carsonhh commented 10 years ago

You may have issues with your sample IDs. They need to be identical with VCF and BAM files you are using. A mismatch could cause MAF values to all be empty for example.

Also whole genome would normally have 2-3M variants per sample. Do you really have samples with less than 600,000?

--Carson

From: sicotte notifications@github.com Reply-To: WaveCNV/WaveCNV-caller <reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git hub.com> Date: Monday, March 31, 2014 at 3:13 PM To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com Cc: Carson Holt carsonhh@gmail.com Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

It's 30X whole genome called using GATK. There are ~600-1.8M variants/sample. I commented out the temp file.. and it only puts 5 segments in the file. I'm trying to find where in estimate_copy_fraction, the segments are filtered to see if there is a cutoff I can change.

— Reply to this email directly or view it on GitHub https://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39142732 .

sicotte commented 10 years ago

Those 5 segments have higher than normal coverage. I checked that the bam files and vcf sampleids were the same.. and that the chromosome numbering of bam,vcf, and segments were all in "chr" format. The bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline. The variants are only called if they are in both the germline and tumor.. Yeah, I was surprised to only have ~600K variants so I double-checked the log files.

I created a new variable to define the minimum number of variants per segment.. (to 200 instead of the 1000 that was hardcoded in discovery_segments) .. and that worked.

carsonhh commented 10 years ago

I think your variant file might have been filtered to have somatic variants only which is not the right thing to do for CNV analysis. While it makes sense for SNV prioritization, having filters such as somatic filtering and dbSNP filtering remove useful information that is important for finding patterns in MAF, LOH, etc.

Thanks, Carson

From: sicotte notifications@github.com Reply-To: WaveCNV/WaveCNV-caller <reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git hub.com> Date: Monday, March 31, 2014 at 3:31 PM To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com Cc: Carson Holt carsonhh@gmail.com Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Those 5 segments have higher than normal coverage. I checked that the bam files and vcf sampleids were the same.. and that the chromosome numbering of bam,vcf, and segments were all in "chr" format. The bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline. The variants are only called if they are in both the germline and tumor.. Yeah, I was surprised too.. so I double-checked the log files.

I created a new variable to define the minimum number of variants per segment.. (to 200 instead of the 1000 that was hardcoded in discovery_segments) .. and that worked.

— Reply to this email directly or view it on GitHub https://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39144686 .

sicotte commented 10 years ago

You were right. I’ll need to recall the VCF, there are no SNV which have the same genotype in tumor and germline.

Nevertheless, there are still a couple of things that could be made to make the script more robust.

I got the two errors.

Use of uninitialized value $cmp in numeric eq (==) at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 2241. Use of uninitialized value $cmp in numeric eq (==) at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 2247.

Sort subroutine didn't return a numeric value at /data5/sicotte/tools/WaveCNV-caller/bin/cnv_caller.pl line 4866.

From: carsonhh [mailto:notifications@github.com] Sent: Monday, March 31, 2014 4:53 PM To: WaveCNV/WaveCNV-caller Cc: Sicotte, Hugues, Ph.D. Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

I think your variant file might have been filtered to have somatic variants only which is not the right thing to do for CNV analysis. While it makes sense for SNV prioritization, having filters such as somatic filtering and dbSNP filtering remove useful information that is important for finding patterns in MAF, LOH, etc.

Thanks, Carson

From: sicotte notifications@github.com<mailto:notifications@github.com> Reply-To: WaveCNV/WaveCNV-caller <reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git hub.com> Date: Monday, March 31, 2014 at 3:31 PM To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com<mailto:WaveCNV-caller@noreply.github.com> Cc: Carson Holt carsonhh@gmail.com<mailto:carsonhh@gmail.com> Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Those 5 segments have higher than normal coverage. I checked that the bam files and vcf sampleids were the same.. and that the chromosome numbering of bam,vcf, and segments were all in "chr" format. The bam file was from bwa.

Oh, it requires 2MB segments. Got 275 and 461 in tumor and germline. The variants are only called if they are in both the germline and tumor.. Yeah, I was surprised too.. so I double-checked the log files.

I created a new variable to define the minimum number of variants per segment.. (to 200 instead of the 1000 that was hardcoded in discovery_segments) .. and that worked.

— Reply to this email directly or view it on GitHub https://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39144686 .

— Reply to this email directly or view it on GitHubhttps://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39146914.

carsonhh commented 10 years ago

It's one of those things where I really just want to have more informative error messages as opposed to building in mechanisms that allow the code to continue beyond those failures.

sicotte commented 10 years ago

After recalling variants with GATK (and aside from the --fasta bug I reported), I was able to run WaveCNV.pl ..

carsonhh commented 10 years ago

Good to know. I'll look at those other issues.

sicotte commented 10 years ago

3 feedbacks: 1. In the “cell” mode, it roughtly gets the (known) purity in 19 out of 25 cases, the other times it’s way off. The linear model fit to maf doesn’t account for subclones which can really mess up the purity estimate even after you take CNV into accounts (Subclone-specific mutations drag the purity estimate down). We’re working on an update to purBayes to better compute purity taking subclones into account.

I notice the code always output an estimated purity as in “cell” mode. Does this affect the results? How is that estimate used?

  1. Do you have some documentation on the meaning of the output fields.

  2. I had one sample that kept not finishing without any error message.. just the Log file would get truncated: … e.g. [Mon Apr 7 09:59:42 2014] Processing: 8%

When I bumped the RAM from 15G to 20G then 31, it went further If I didn’t crank the RAM, it went no further. However, I am not convinced it’s a memory issue since the maxvmem usage in the run that worked said it only used up to 1.15G Maybe it’s an issue with the memory mapping library (MMapArray) WaveCNV needs (I had to fix a bug (which I reported) in the current version in CPAN before it would compile) when compiling one Perl module that was needed.

From: carsonhh [mailto:notifications@github.com] Sent: Monday, April 07, 2014 9:13 AM To: WaveCNV/WaveCNV-caller Cc: Sicotte, Hugues, Ph.D. Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Good to know. I'll look at those other issues.

— Reply to this email directly or view it on GitHubhttps://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39734550.

carsonhh commented 10 years ago

I've found that most of the time the cell mode value is different than the lab estimated purity, the lab estimate is wrong. At least this was true in pancreatic cancer where KRAS is used to estimate purity. Unfortunately all the lab tests assume diploid KRAS, and often it was triploid, tetraploid, or more (there is a selective advantage for KRAS duplication). We fortunately had cell lines in addition to primary tumor in those cases, to validate which calls were more correct. Also the number I give is more correctly called copy fraction than purity or cellularity (it is the ratio of reads representing 1 copy in the tumor compared to 1 copy in the normal contamination - if a tumor is triploid etc. it will not be proportional to cellularity as that is a cell-to-cell ratio not a copy-to-copy ratio). I'm not saying that the value you get with the -cell flag is always right, but lab estimates that use specific genes or markers always way under estimate contamination when there is a ploidy issue with a marker. In general sub clones usually aren't different enough or abundant enough to throw off copy fraction estimates. We've actually been able to use regions that vary from expected MAF given estimated copy fraction to identify different sub clones that dominate in xenograft passages (thus allowing us to validate the % of each sub clone in the primary). In our samples, it was primarily 1 copy LOH in one subclone mixed with 2 copy LOH in the other that cause the biggest variance. These types of regions were also the most frequent type of sub clonal CNV-LOH difference we saw (but still sub clones overwhelmingly matched in more regions than they differed on 20 samples).

The copy fraction value affects the expected MAF expect at different CN states.

No documentation is available other than the WaveCNV publication and supplementary material. I plan on going back to update the code sometime in the next couple of months. There are a number of outstanding issues (basically I was trying to get the paper out before moving to a new institution, so it's not as robust, feature rich, or as well documented as I'd like it to be).

If one of your samples has a lot of segments (100,000+), it can fail because I implement some things using berkley DB to make the script restartable, and it fails frequently when the number of entries in the DB is too high. That might be the issue you are experiencing.

--Carson

From: sicotte notifications@github.com Reply-To: WaveCNV/WaveCNV-caller <reply+i-30547990-b7107839b5ff9e4e96ef2e7d590e5e3f07d88552-4742143@reply.git hub.com> Date: Monday, April 7, 2014 at 9:40 AM To: WaveCNV/WaveCNV-caller WaveCNV-caller@noreply.github.com Cc: Carson Holt carsonhh@gmail.com Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

3 feedbacks:

  1. In the “cell” mode, it roughtly gets the (known) purity in 19 out of 25 cases, the other times it’s way off. The linear model fit to maf doesn’t account for subclones which can really mess up the purity estimate even after you take CNV into accounts (Subclone-specific mutations drag the purity estimate down). We’re working on an update to purBayes to better compute purity taking subclones into account.

I notice the code always output an estimated purity as in “cell” mode. Does this affect the results? How is that estimate used?

  1. Do you have some documentation on the meaning of the output fields.
  2. I had one sample that kept not finishing without any error message.. just the Log file would get truncated: … e.g. [Mon Apr 7 09:59:42 2014] Processing: 8%

When I bumped the RAM from 15G to 20G then 31, it went further If I didn’t crank the RAM, it went no further. However, I am not convinced it’s a memory issue since the maxvmem usage in the run that worked said it only used up to 1.15G Maybe it’s an issue with the memory mapping library (MMapArray) WaveCNV needs (I had to fix a bug (which I reported) in the current version in CPAN before it would compile) when compiling one Perl module that was needed.

From: carsonhh [mailto:notifications@github.com] Sent: Monday, April 07, 2014 9:13 AM To: WaveCNV/WaveCNV-caller Cc: Sicotte, Hugues, Ph.D. Subject: Re: [WaveCNV-caller] How to run with --cfrac 0.8 (#3)

Good to know. I'll look at those other issues.

— Reply to this email directly or view it on GitHubhttps://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39734 550.

— Reply to this email directly or view it on GitHub https://github.com/WaveCNV/WaveCNV-caller/issues/3#issuecomment-39746133 .