NBISweden / GenErode

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.
GNU General Public License v3.0
21 stars 7 forks source link

individual vcfs merging failing #39

Closed ndussex closed 1 year ago

ndussex commented 1 year ago

Hi,

I am trying to run step 9 ( 9_merge_vcfs.smk) with the latest version of the pipeline but the first job of this rule (i.e. bcftools merge) always fails without any indication of the possible error in the slurm.out or log files. The first merged.bcf is then empty, only containing the header. It could be a memory issue (68 mammalian genomes), but I am not sure. If I run the rule manually outside the pipeline, the same first step fails as well.

Previously, I would run this rule manually with a slightly different script, but since I want to estimate load with GERP scores, I want to merge the vcfs inside the pipeline.

My guess is that the '-m snps' option is causing this problem in the bcftools merge step:

bcftools merge -m snps -O b -o {output.merged} {input.bcf} 2> {log}

I didn't use it in my manual merging script since the selection of snps only is used in the following step:

bcftools view -m2 -M2 -v snps -Ob -o {output.bcf} {input.bcf} 2> {log} &&

But maybe I am wrong.

Thanks for your help!

verku commented 1 year ago

Hi Nic! Thanks for reporting this bug. For me this step works but I've only run the pipeline with a few samples at a time. I'll create a new branch to try out your suggestion, could you test it with your dataset?

verku commented 1 year ago

One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).

ndussex commented 1 year ago

Hi Verena, I understand the reasoning behind the "-m snps" flag now.

Yes, I tried modifying the cluster file as follows:

. . partition: node time: 09:00:00 . .

merge and process VCF files

merge_all_vcfs: time: 2-23:40:00 cpus-per-task: 1

But it failed as described above. It generates the first merged vcf but then fails when filtering for biallelic sites. It also deletes the merged.bcf file.

Cheers, Nic

On Mon, 20 Feb 2023 at 09:41, Verena Kutschera @.***> wrote:

One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1436553515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVI22CK67LIWZDRC2GLWYMU25ANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

ndussex commented 1 year ago

Hi again,

I think I found the issue by running each command manually. The problem doesn't seem to have to do with memory, but happens when I filter my vcf for missing data:

rule filter_vcf_missing: bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf}

I am setting my missing data threshold to '0' and for some reason all my SNPs are filtered out.

However, if I use plink as below, I get a filtered file as expected: plink --vcf myvcf.gz --geno 0 --allow-extra-chr --keep 77_genomes.txt --recode --real-ref-alleles --out myvcf_zero_missing --make-bed

So, I am a bit confused as to why.

Cheers, Nic

On Fri, 24 Feb 2023 at 08:45, Nicolas Dussex @.***> wrote:

Hi Verena, I understand the reasoning behind the "-m snps" flag now.

Yes, I tried modifying the cluster file as follows:

. . partition: node time: 09:00:00 . .

merge and process VCF files

merge_all_vcfs: time: 2-23:40:00 cpus-per-task: 1

But it failed as described above. It generates the first merged vcf but then fails when filtering for biallelic sites. It also deletes the merged.bcf file.

Cheers, Nic

On Mon, 20 Feb 2023 at 09:41, Verena Kutschera @.***> wrote:

One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1436553515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVI22CK67LIWZDRC2GLWYMU25ANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

verku commented 1 year ago

Hi Nic! Thanks for looking into that, I'll try to re-create the error with my testdata and will get back to you.

verku commented 1 year ago

Hi! I might have an idea: did you specify f_missing in the config.yaml file (or F_MISSING in your manual script) as 0 or as 0.0? In bcftools, the F_MISSING parameter has to be a floating point number as it specifies the fraction of allowed missing genotypes. I didn't clarify this in the config.yaml file properly, which I'll fix in the next version.

ndussex commented 1 year ago

Hi Verena, I remember trying both as the code suggested. I'm doing another try now with 0.0, should be finished tomorrow. So, I'll let you know.

Nic

On Mon, 27 Feb 2023 at 13:50, Verena Kutschera @.***> wrote:

Hi! I might have an idea: did you specify f_missing in the config.yaml file (or F_MISSING in your manual script) as 0 or as 0.0? In bcftools, the F_MISSING parameter has to be a floating point number as it specifies the fraction of allowed missing genotypes.

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1446270437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVM5VQBXLNRUHR4T22LWZSPJDANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

verku commented 1 year ago

Thank you for double checking!

verku commented 1 year ago

Hi! I know what went wrong, 'F_MISSING < 0.0' can't work - it must be 'F_MISSING = 0.0'. I'm working on a solution and will create a new pipeline version as soon as I've tested it thoroughly. I'll let you know when the pipeline is functioning and available!

verku commented 1 year ago

Hi Nic! There is a solution now available on the branch dev that I tested with the test dataset. If you have the time, would you like to test it on your data? Here is the code if you would like to run it in a separate script:

bcftools view -i 'F_MISSING = 0.0' -Oz -o {output.vcf} {input.bcf} 2> {log}

Otherwise, you can run git checkout dev to switch to the branch dev with the new solution and run it in the pipeline.

ndussex commented 1 year ago

Hi Verena,

I manually edited the rule yesterday and it is running now. I will let you know.

Thanks!

On Tue, 28 Feb 2023 at 10:41, Verena Kutschera @.***> wrote:

Hi Nic! There is a solution now available on the branch dev that I tested with the test dataset. If you have the time, would you like to test it on your data? Here is the code if you would like to run it in a separate script:

bcftools view -i 'F_MISSING = 0.0' -Oz -o {output.vcf} {input.bcf} 2> {log}

Otherwise, you can run git checkout dev to switch to the branch dev with the new solution and run it in the pipeline.

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1447864799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVO4JQKAGG3JQ2JQ7ELWZXB4FANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

verku commented 1 year ago

Awesome, thanks!

To clarify, the updated code checks the value of f_missing and runs bcftools in different ways for f_missing = 0.0, f_missing = 1.0, or a value 0.0 < f_missing < 1.0:

https://github.com/NBISweden/GenErode/pull/42/files?file-filters%5B%5D=.smk&show-viewed-files=true

# only include sites with zero missing data
 if [[ `echo 0.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
  bcftools view -i 'F_MISSING = {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include all sites
elif [[ `echo 1.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
  bcftools view -i 'F_MISSING <= {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include sites with less than the fraction f_missing of missing data
elif [[ `echo 0.0 {params.fmiss} 1.0 | awk '{{print ($1 < $2 && $2 < $3)}}'` == 1 ]]
then 
  bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
fi

bcftools index -f {output.vcf} 2>> {log}
ndussex commented 1 year ago

Hi Verena,

It seems to be working fine now. Thanks for fixing it so quickly.

Nic

On Tue, 28 Feb 2023 at 11:31, Verena Kutschera @.***> wrote:

Awesome, thanks!

To clarify, the updated code checks the value of f_missing and runs bcftools in different ways for f_missing = 0.0, f_missing = 1.0, or a value 0.0 < f_missing < 1.0:

https://github.com/NBISweden/GenErode/pull/42/files?file-filters%5B%5D=.smk&show-viewed-files=true

only include sites with zero missing data

if [[ echo 0.0 {params.fmiss} | awk '{{print ($1 == $2)}}' == 1 ]] then bcftools view -i 'F_MISSING = {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}

include all sites

elif [[ echo 1.0 {params.fmiss} | awk '{{print ($1 == $2)}}' == 1 ]] then bcftools view -i 'F_MISSING <= {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}

include sites with less than the fraction f_missing of missing data

elif [[ echo 0.0 {params.fmiss} 1.0 | awk '{{print ($1 < $2 && $2 < $3)}}' == 1 ]] then bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log} fi

bcftools index -f {output.vcf} 2>> {log}

— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1447938230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVL5D3WEIFTPTE64IPDWZXHYXANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Nic Dussex PhD

Postdoctoral Researcher Department of Natural History

NTNU University Museum

7012 Trondheim

verku commented 1 year ago

Thank you for testing the code! I'll just run some final quick checks and will release a new version with the corrected code.