andrej-fischer / cloneHD

High-definition reconstruction of clonal composition from next-generation sequencing data
GNU General Public License v3.0
39 stars 10 forks source link

cloneHD error #20

Closed afichtenholtz closed 8 years ago

afichtenholtz commented 8 years ago

Trying to run cloneHD using my files generated from filterHD step. Running into a error without much explanation:

./build/cloneHD --cna tumor.exome.cloneHD.noY --chr normal.exome.cloneHD.noY --baf tumor.baf --pre output/tumor --bias normal.cna.posterior-1.txt --seed 123 --trials 2 --nmax 3 --force --max-tcn 4 --cna-jumps tumor.baf.jumps.txt --min-jump 0.01 --restarts 10 --mass-gauging 1

cloneHD: probabilistic inference of sub-clonality using...

CNA data in tumor.exome.cloneHD.noY: 193717 sites in 23 chr across 1 samples BAF data in tumor.baf: 31724 sites in 23 chr across 1 samples

Aborted (core dumped)

Where to go from here?

ivazquez commented 8 years ago

It looks like you are over-segmenting the CNA and BAF tumour data (193717 sites and 31724 sites). You typically do not expect that many breakpoints due to copy number in most cancer types.

CNA data in tumor.exome.cloneHD.noY: 193717 sites in 23 chr across 1 samples BAF data in tumor.baf: 31724 sites in 23 chr across 1 samples

If you haven't done so yet, it might be worth using the pre-filter binary on the CNA files to exclude regions that are highly variable in read depth. You can start with the default pre-filter parameters (see documentation here). For BAF files, you may need to exclude segments with a frequency that differs greatly from 0.5.

Finally, you should be cautious in estimating the bias field from normal CNA with exome data, as the read depth is very noisy due to the exome capture baits. Instead, you can try to provide the BAF jumps for both CNA and BAF data, as BAF jumps are less sensitive to coverage bias. You can do this by providing the BAF jumps file to both --cna-jumps and --baf-jumps.

CC: @tsunpo, who ran into a similar issue.

afichtenholtz commented 8 years ago

Re-ran the pipeline using pre-filter on CNA data:

1) $ ./build/pre-filter --data normal.exome.cloneHD.noY --pre output/pre/normal --print-tracks 1 --window-size 100 --remove-outlier 3.0 --remove-variable 2.0

2) $ ./build/pre-filter --data tumor.exome.cloneHD.noY --pre output/pre/tumor --print-tracks 1 --window-size 100 --remove-outlier 3.0 --remove-variable 2.0

3) $ ./build/filterHD --data output/pre/normal.pref.txt --mode 3 --pre output/normal.cna

filterHD: Fitting jump-diffusion model to data at 93230 loci in 23 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 3.101e+01 < x < 1.480e+02 eval jump sigma rnd -llh 10 4.30036e-08 1.33689e-01 1.53180e-03 4.1001755783e+05 20 1.73556e-08 2.39966e-01 1.51230e-03 4.0953961210e+05 30 1.03552e-09 1.58394e-01 2.11519e-02 4.0903191056e+05 40 1.04186e-09 1.42321e-01 2.32230e-02 4.0898980591e+05 50 1.04859e-09 1.26858e-01 2.56449e-02 4.0896732846e+05 60 1.03352e-09 1.14154e-01 2.87621e-02 4.0895042858e+05 70 1.03066e-09 1.16723e-01 2.82939e-02 4.0894723188e+05 80 1.03066e-09 1.16723e-01 2.82939e-02 4.0894723188e+05 Filtered sample 1 of 1: llh = -4.08947e+05, gof = 8.86320e+00, --jump 1.031e-09 --sigma 1.163e-01 --rnd 2.839e-02

4) $ ./build/filterHD --data output/pre/tumor.pref.txt --mode 3 --pre output/tumor.cna

filterHD: Fitting jump-diffusion model to data at 92406 loci in 23 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 3.201e+01 < x < 1.600e+02 eval jump sigma rnd -llh 10 5.21625e-09 3.09970e-01 4.44360e-06 4.0698145911e+05 20 4.95740e-09 1.95732e-01 2.96055e-06 4.0664311073e+05 30 4.59820e-09 1.82775e-01 3.50319e-06 4.0663900088e+05 Filtered sample 1 of 1: llh = -4.06639e+05, gof = 7.93804e+00, --jump 4.591e-09 --sigma 1.827e-01 --rnd 3.499e-06

5) $ ./build/filterHD --data output/pre/tumor.pref.txt --mode 3 --pre output/tumor.cna.bias --bias output/normal.cna.posterior-1.txt --sigma 0 --jumps 1

filterHD: Fitting jump-diffusion model to data at 92406 loci in 23 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 4.840e-01 < x < 2.044e+00, 3.201e+01 < y < 1.600e+02 eval jump rnd -llh 10 1.50665e-08 3.36887e-02 3.8814320156e+05 20 1.52549e-07 4.93270e-02 3.8770199465e+05 30 1.56011e-07 4.80623e-02 3.8770128936e+05 Filtered sample 1 of 1: llh = -3.87701e+05, gof = 9.80402e+00, --jump 1.563e-07 --sigma 0.000e+00 --rnd 4.800e-02

6) $ ./build/filterHD --data tumor.baf --mode 1 --pre output/tumor.baf --sigma 0 --jumps 1 --reflect 1 --dist 1

filterHD: Fitting jump-diffusion model to data at 31724 loci in 23 segment(s) in 1 sample(s) with a binomial emission model:

Filtering sample 1 of 1: Initial range is 0.000e+00 < x < 1.000e+00 eval jump rnd -llh 10 3.64387e-07 5.20308e-02 9.7510598308e+04 20 2.00906e-07 9.48803e-02 9.7295227597e+04 30 1.86088e-07 8.93843e-02 9.7290331016e+04 Filtered sample 1 of 1: llh = -9.72903e+04, gof = 6.02729e-02, --jump 1.870e-07 --sigma 0.000e+00 --rnd 8.917e-02

7) $ ./build/cloneHD --cna output/pre/tumor.pref.txt --chr output/pre/normal.pref.txt --baf tumor.baf --pre output/tumor --bias output/normal.cna.posterior-1.txt --seed 123 --trials 2 --nmax 3 --force --max-tcn4 --cna-jumps output/tumor.cna.bias.jumps.txt --baf-jumps output/tumor.baf.jumps.txt --min-jump 0.01 --restarts 10 --mass-gauging 1

cloneHD: probabilistic inference of sub-clonality using...

CNA data in output/pre/tumor.pref.txt: 92406 sites in 23 chr across 1 samples BAF data in /home/afichtenholtz/Downloads/2014-2071R_recal.baf: 31724 sites in 23 chr across 1 samples

Aborted (core dumped)

As you can see, pre-filter does reduce segmentation, but it still seems like the sample is way over-segmented. Do you think adjusting the pre-filter parameters can get the sample to a usable state.

I'm not sure I understand the advice for the segments with 'deviant' allele frequencies. My sample isn't diploid, does that mean that I won't have any BAF data for the majority of my segments?

Any other advice would be appreciated.

ivazquez commented 8 years ago

The memory allocation error is caused by the normal and tumour CNA files being defined on different grids. This is most likely due to different regions being excluded in each sample in the pre-filter step, but you can get around it using pre-filter for normal CNA first, then pre-filter again for tumour CNA using the --pick-from/--match-to flags.

About the BAF data, you typically expect frequencies at [0.0, 0.5, 1.0] and any deviations from normal copy number will open up new intermediate genotypes between these. Are you working with non-human data such that your normal sample is not diploid? You should still have many informative jumps if you keep segments within e.g. 0.2 <= BAF <= 0.8.

afichtenholtz commented 8 years ago

Using the --pick-from/--match-to flags with a larger window size seems to improve the situation, but I'm still running into the fatal error (note that I also took your suggestion and filtered the baf file so it only contains frequencies 0.2<=BAF<=0.8):

1) $ ./build/pre-filter --data normal.depth.noXY --pre output/pre/normal --print-tracks 1 --window-size 1000 --remove-outlier 3.0 --remove-variable 2.0

2) $ ./build/pre-filter --pre output/pre/tumor --print-tracks 1 --window-size 1000 --remove-outlier 3.0 --remove-variable 2.0 --pick-from tumor.depth.noXY --match-to output/pre/normal.pref.txt

3) $ ./build/filterHD --data output/pre/normal.pref.txt --mode 3 --pre output/normal.cna

filterHD: Fitting jump-diffusion model to data at 87248 loci in 22 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 3.801e+01 < x < 1.410e+02 eval jump sigma rnd -llh 10 3.16427e-06 5.39795e-03 4.53424e-05 3.8898103826e+05 20 3.37334e-06 5.20514e-03 6.59411e-05 3.8896386047e+05 30 3.38281e-06 5.16188e-03 6.73038e-05 3.8895814738e+05 Filtered sample 1 of 1: llh = -3.88958e+05, gof = 1.10985e+01, --jump 3.383e-06 --sigma 5.162e-03 --rnd 6.743e-05

4) $ ./build/filterHD --data output/pre/tumor.pref.txt --mode 3 --pre output/tumor.cna

filterHD: Fitting jump-diffusion model to data at 87257 loci in 22 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 3.201e+01 < x < 4.270e+02 eval jump sigma rnd -llh 10 7.92376e-07 1.35703e-02 8.69023e-06 4.1748878335e+05 20 8.03939e-07 1.36397e-02 1.10851e-05 4.1747185920e+05 30 8.04270e-07 1.36499e-02 1.10140e-05 4.1746880536e+05 Filtered sample 1 of 1: llh = -4.17468e+05, gof = 1.14435e+01, --jump 8.039e-07 --sigma 1.365e-02 --rnd 1.104e-05

5) $ ./build/filterHD --data output/pre/tumor.pref.txt --mode 3 --pre output/tumor.cna.bias --bias output/normal.cna.posterior-1.txt --sigma 0 --jumps 1

filterHD: Fitting jump-diffusion model to data at 87257 loci in 22 segment(s) in 1 sample(s) with a poisson emission model:

Filtering sample 1 of 1: Initial range is 4.556e-01 < x < 5.963e+00, 3.201e+01 < y < 4.270e+02 eval jump rnd -llh 10 2.48514e-09 6.58697e-02 4.0231009620e+05 20 2.74775e-08 5.76677e-02 4.0192382306e+05 30 9.48822e-08 4.90480e-02 4.0180854973e+05 40 9.02748e-08 4.90354e-02 4.0180833715e+05 Filtered sample 1 of 1: llh = -4.01808e+05, gof = 1.30558e+01, --jump 9.036e-08 --sigma 0.000e+00 --rnd 4.902e-02

6) $ ./build/filterHD --data tumor.baf.noXY --mode 1 --pre output/tumor.baf --sigma 0 --jumps 1 --reflect 1 --dist 1

filterHD: Fitting jump-diffusion model to data at 26287 loci in 22 segment(s) in 1 sample(s) with a binomial emission model:

Filtering sample 1 of 1: Initial range is 0.000e+00 < x < 1.000e+00 eval jump rnd -llh 10 1.57175e-08 1.18743e-02 7.1361597157e+04 20 3.94329e-08 1.05387e-02 7.1334968211e+04 Filtered sample 1 of 1: llh = -7.13349e+04, gof = 4.62444e-02, --jump 4.081e-08 --sigma 0.000e+00 --rnd 1.054e-02

7) $ ./build/cloneHD --cna output/pre/tumor.pref.txt --chr output/pre/normal.pref.txt --baf tumor.baf.noXY --pre output/tumor --bias output/normal.cna.posterior-1.txt --seed 123 --trials 2 --nmax 3 --force --max-tcn4 --cna-jumps output/tumor.cna.bias.jumps.txt --baf-jumps output/tumor.baf.jumps.txt --min-jump 0.01 --restarts 10 --mass-gauging 1

cloneHD: probabilistic inference of sub-clonality using...

CNA data in output/pre/tumor.pref.txt: 87257 sites in 22 chr across 1 samples BAF data in tumor.baf.noXY: 26287 sites in 22 chr across 1 samples

Aborted (core dumped)

Seems like there is still a slight discrepancy between the grid used for the normal (87248 loci) and the one used for the sample (87257 loci). Is this what is causing the error? Do you recommend trying to 'manually' align the grid at this point (i.e. editing the output of the pre-filter command)?

ivazquez commented 8 years ago

That looks better, though as you say the differences between the grids are still causing the same error. Do you know anything more specific about the offending coordinates? Just to be sure, it is maybe worth checking that you are not using older files defined on different grids when you run filterHD in steps 3, 4, 5.

It is difficult to know if you are still over-segmenting and how to move forward without seeing the technical variability. What I would try first would be using cloneHD with BAF jumps only, both for --cna-jumps and --baf-jumps, instead of estimating the tumour CNA bias and providing the tumour CNA jumps as you do now.

afichtenholtz commented 8 years ago

I have made some progress on this issue. I got cloneHD to process my sample after addressing the following two things:

1) I noticed in your example that your b-allele file is defined on the same grid as your depth files. This wasn't true for my data, so I recalculated the depth files so they would contain the coordinates present in my b-allele file.

2) I was referencing the normal bias file for cloneHD's --bias flag, whereas I should have been referencing the tumor bias file.

Although I don't know which of these fixed my seg fault, cloneHD now processes this sample and I consider this issue closed.