ARUP-NGS / cobalt

Cobalt is a tool to detect copy-number variants from next generation sequencing (NGS) data
GNU General Public License v3.0
5 stars 1 forks source link

empty output bed file #1

Closed ramujana closed 5 years ago

ramujana commented 5 years ago

Hi, I really liked your ellegant code and smooth integration process, however, after all steps executed correctly, I've got an empty output bed file for CNVs detection with just a heading. I used 35 samples for training, 750 regions included, only 17 of them were masked. I'm running ubuntu on windows. Could you please provide some clues???

p.s. final workload looked following: $ cobalt predict -m TS_CANCER_MODEL.model -d SAMPLE_COVERAGE.bed -o SAMPLE_CNV.bed 2019-05-15 13:00:36 --> Loading model from TS_CANCER_MODEL.model 2019-05-15 13:00:36 --> Read depths for 750 regions for 1 samples 2019-05-15 13:00:36 --> Mean depth for sample H19-94VHL.bam : 572.89 (sd: 757.54) 2019-05-15 13:00:36 --> Beginning prediction run with alpha = 0.005, beta = 0.005 and min_output_quality = 0.8 2019-05-15 13:00:36 --> Applying region mask, removing 17 targets 2019-05-15 13:00:36 --> Determining autosomal copy number states and qualities 2019-05-15 13:00:36 --> Precomputing observation probs... 2019-05-15 13:00:36 --> No X chromosome regions found, not calling CNVs on X chromosome 2019-05-15 13:00:36 --> No Y chromosome regions found, not calling CNVs on X chromosome 2019-05-15 13:00:36 --> Done computing probabilities, emitting output to SAMPLE_CNV.bed

brendanofallon commented 5 years ago

Hi there, there are a few possibilities here, but I think the first question is are you looking for a germline mutation? The names of your input files suggest that you might be using somatic data. Since Cobalt is really designed to look for germline CNVs, it's unlikely that it will pick up somatic variation unless there's a really profound alteration in read depth (on the order of 50%). If you're indeed looking for germline variation, then the next step is to investigate the target-specific info by running the prediction again while using the "--target-info [filename]" option. This will write some diagnostic information including log2s and mean copy number estimate for every target. If the CNV you're expecting is just on the verge of detection, then you might see some variability log2s near the region, which would indicate that there's indeed a signal but it didn't quite make the cutoff for calling the cnv. Lastly, there are a couple of options that can increase sensitivity. Firstly, you can set the minimum quality cutoff at something smaller than the default value of 0.9, using the "--min-qual [x]" option. Setting it to 0.7 or 0.6 will cause cobalt to emit calls with lower quality values. And lastly the "-a [x]" and "-b [x]" can be set to higher values to increase sensitivity, consider trying 0.01 or even 0.05 to see if the CNV you're expecting pops up. Let me know how it goes!

ramujana commented 5 years ago

Hi, thanks for suggestions, unfortunatelly none of them helped - there is still an empty output file... I'm working on germline DNA. Meanwhile, I've noticed, that during model training, all samples had depth CV >1, and I've increased --max-cv to allow retain all samples. I've checked model info with desc - it looks comprehensive. The sample of interest also has high sd (sd: 757.54) - could that be an issue? (all samples are from the same run).

brendanofallon commented 5 years ago

Hmm, interesting. Another option is that the region with the CNV is being 'masked' because of excessively high, low, or variable depth - cobalt detects and removes such regions prior to generating the model (I'm actually not sure how great on an idea this is). You can try generating the model with the '--no-mask' option to disable all region masking. To know if the region is getting masked you can try running cobalt desc -m [your model] --emit-bed - this will emit a BED file of all the targets, along with a value indicating if each target was masked and if not, the 'resolution' of the target (> 10 is good, < 5 is pretty bad). If the CNV of interest overlaps a region with poor target resolution, it's going to be pretty hard cobalt to detect it. How did the log2s and copy number estimates from the --target-info output look? Also, can you share how big (in terms of targets) the CNV you're expecting to see is? Sensitivity can be poor, especially for small (1-2 target) CNVs, and duplications are harder to detect than deletions.

How big is the CNV, in terms of the number of targets that it overlaps?

ramujana commented 5 years ago

hi, i had the same idea and tested already by generating unmasked model - the regions of interest are quite well represented (in the sample I expect to see the heterozygous deletion of these regions): chr3 10188197 10188321 43.05 chr3 10191470 10191650 23.97

btw, the target info file came out empty as well, only headers marked..

brendanofallon commented 5 years ago

The target info file should never be empty, so there's a technical problem of some kind. Would you be willing to share a minimal set of data that reproduces the problem? Also, can you clarify what you mean by "ubuntu on windows", are you running in WSL (windows subsystem for linux)? To be honest I haven't conducted any testing in that environment, so there could be an issue there

ramujana commented 5 years ago

Thanks for pointing - i've just tried to run cobalt on Mac with python 3.7 and the end point was the same... Previously, I used the ubuntu in wsl. As requested, for the testing of the problem, I'm adding txt file of background samples for chr3 read depths (COVERAGE_3CHR.txt) and txt file of the sample of interest (SAMPLE_COVERAGE3.txt), since bed format is not supported here. You should be able to convert to bed and test it. Looking forward to see how it worked for you!

COVERAGE_3CHR.txt SAMPLE_COVERAGE3.txt

.

brendanofallon commented 5 years ago

OK, we identified an (embarrassing) issue relating to chromosome naming. I just pushed a new release with version 0.7.3 with a fix, could you try that? Running your test data seemed to work fine (and there was a 2-target duplication in the area you indicated)

ramujana commented 5 years ago

Thank you, Brendan! Now it worked perfectly - excellent and smooth process! (in fact, it's deletion of VHL gene, spanning all 3 exons - I was unsure about first exon coverage quality, but it picked up all!). I have one more question: regarding model training - can I use the same model for other batches/runs, or need to use only samples from the same run (to avoid variability)? is it advisable to update the model by adding more samples from various runs? Thanks again for your time and issue solving.

brendanofallon commented 5 years ago

Great! I'm glad that worked. Generally it's best to use samples from multiple runs, especially if you're hoping to call CNVs on batches from those runs or additional future runs. Regarding model building, there are two rules of thumb: First, more is typically better, and models with hundreds of background samples from many sequencing batches / runs are ideal. Second, the more similar the background samples are to the query sample the better. If you want to have just one model that will work for any future sample, then build a big model with many samples from lots of runs. If you just have a couple samples that you're interested in, then a smaller model with samples from the same batch is probably OK (it seems to have worked in this case, at least). Best of luck!