brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

smoove genotype step taking a very long time for some samples #164

Open robertwhbaldwin opened 2 years ago

robertwhbaldwin commented 2 years ago

Hi,

I'm doing population calling and made it to the genotype step. There's 25 samples, similar coverage (20X). The first handful of samples finished quickly (1hr). This is the command I used:

smoove genotype -d -x -p 1 --name ${i}-joint --outdir ./ --fasta /assembly/GCF_014851395.1_ASM1485139v1_genomic.fa --vcf merged.sites.vcf.gz /bams/${i}.bam

But then I hit a sample that took a very long time (see log file below). After ~20 hrs the sample was still running so I just stopped it thinking that there must be a problem.

I then noticed that the sample that was taking a long time had no ...smoove.genotype.vcf.gz. For the earlier population calling step that produced the ...smooved.genotyped.vcf.gz files only 13/25 samples actually got these VCF files. For the rest I got the EOF can't read from std input warning which was the topic of another ticket. So it seems to me that the long time it it is taking for the genotyping step to finish may be related to the fact these samples had no ...smoove.genotyped.vcf.gz file. Can someone explain this? Is joint genotyping only meant for samples with the intermediate ...smoove.genotyped.vcf.gz files?

Thank You - Robert

2021/07/20 16:35:04 [W::hts_idx_load3] The index file is older than the data file: /bams/G0620_M02.bam.bai [smoove] 2021/07/20 16:35:36 [smoove] 2021/07/20 16:35:36 starting with version 0.2.6 [smoove] 2021/07/20 16:35:36 [smoove] 2021/07/20 16:35:36 running duphold on 1 files in 16 processes [smoove] 2021/07/20 16:35:36 [smoove] 2021/07/20 16:35:36 [W::hts_idx_load2] The index file is older than the data file: /bams/G0620_M02.bam.bai [smoove] 2021/07/20 16:37:29 [smoove] 2021/07/20 16:37:29 [duphold] finished [smoove] 2021/07/20 16:37:29 [smoove] 2021/07/20 16:37:29 finished duphold [smoove] 2021/07/20 16:37:29 wrote sorted, indexed file to G0620_M02-joint-smoove.genotyped.vcf.gz RHF05301 [smoove] 2021/07/20 16:37:29 starting with version 0.2.6 [smoove] 2021/07/20 16:37:29 writing sorted, indexed file to RHF05301-joint-smoove.genotyped.vcf.gz [smoove] 2021/07/20 16:37:29 > gsort version 0.0.6 [smoove] 2021/07/20 16:37:29 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 16:54:24 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 17:12:19 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 17:29:36 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 17:47:50 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 18:04:28 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 18:22:17 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 18:39:21 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 18:55:59 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 19:14:42 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 19:30:15 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 19:48:43 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 20:06:24 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 20:22:22 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 20:40:15 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 20:56:57 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 21:13:31 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 21:33:23 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 21:48:45 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 22:06:37 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 22:23:04 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 22:40:23 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 22:59:03 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 23:14:21 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 23:33:33 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/20 23:49:42 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 00:07:04 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 00:24:53 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 00:41:29 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 01:00:39 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 01:21:06 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 01:37:23 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 01:53:16 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 02:09:04 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 02:25:21 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 02:41:30 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 02:57:28 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 03:13:23 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 03:29:13 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 03:45:08 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 04:00:52 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 04:16:32 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 04:32:14 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 04:47:49 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 05:03:39 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 05:19:19 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 05:35:15 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 05:50:55 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 06:06:37 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 06:22:41 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 06:38:33 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 06:54:05 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 07:09:55 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 07:25:34 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 07:41:21 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 07:56:49 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 08:12:47 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 08:28:31 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 08:44:18 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 09:00:12 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 09:16:03 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 09:32:20 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 09:48:09 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 10:03:54 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 10:19:42 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 10:35:54 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 10:51:35 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 11:07:18 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai [smoove] 2021/07/21 11:23:00 [W::hts_idx_load3] The index file is older than the data file: /bams/RHF05301.bam.bai

brentp commented 2 years ago

Hi Robert, I'm a bit confused about the series of events. It always helps if you list all the commands that you ran.

If the genotype.vcf.gz files don't exist for some samples, you'd need to return to that step and make sure that genotyping finishes for those samples. I'd also make sure that your bam index is up-to-date.

robertwhbaldwin commented 2 years ago

The first step in population calling I used this:

smoove call --outdir results-smoove/ --name $sample --fasta $reference_fasta -p 1 --genotype /path/to/$sample.bam

Basically, what you stated on in the doc but I did not include an --exclude bed file.

As I said, only 13/25 of my samples got the genotype.vcf.gz file as output. From what I understand this is not because genotyping did not finish, but because it finished and there was nothing called. Instead of a vcf file I got an EOF warning message. There was a ticket on this forum about this and that was the conclusion. No calls. Seems odd to me that half my samples would have calls and the other half would not.

I believe the warning message looked like this: [smoove] 2020/04/28 09:39:43 2020/04/28 09:39:43 EOF [smoove] 2020/04/28 09:39:43 Failed to open -: unknown file type panic: exit status 255

brentp commented 2 years ago

You should have calls unless you are doing targetted sequencing or extremely low coverage.

robertwhbaldwin commented 2 years ago

yes, there's a problem with the samples with no vcf from the call step, a lot of overlapping reads, for example. I ran picard collect_wgs_metrics and after you exclude overlapping reads, pcr dups, reads with mapping quality < 20, etc., the mean coverage is 7-9X. It should be at least 20X so a lot of reads were discarded by these filters. But would that explain why there's no vcf for these samples when I run smoove call? Because even after picard filrtering it's not extremely low coverage. But there's obviously a problem with the data and I don't know the extent of it yet.

brentp commented 2 years ago

it would help to see the full output of a job that failed on the call step. but yeah, sounds like a problem with your data. maybe the job ran out of memory or time.

robertwhbaldwin commented 2 years ago

it turns out some of the bam files were generated using R1 and R1 instead of R1 and R2 files. so we got half the coverage. And no SV obviously since it's not actually paired end.