Closed ndussex closed 1 year ago
Hi Nic! Thanks for reporting this bug. For me this step works but I've only run the pipeline with a few samples at a time. I'll create a new branch to try out your suggestion, could you test it with your dataset?
One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).
Hi Verena, I understand the reasoning behind the "-m snps" flag now.
Yes, I tried modifying the cluster file as follows:
. . partition: node time: 09:00:00 . .
merge_all_vcfs: time: 2-23:40:00 cpus-per-task: 1
But it failed as described above. It generates the first merged vcf but then fails when filtering for biallelic sites. It also deletes the merged.bcf file.
Cheers, Nic
On Mon, 20 Feb 2023 at 09:41, Verena Kutschera @.***> wrote:
One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1436553515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVI22CK67LIWZDRC2GLWYMU25ANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
Hi again,
I think I found the issue by running each command manually. The problem doesn't seem to have to do with memory, but happens when I filter my vcf for missing data:
rule filter_vcf_missing: bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf}
I am setting my missing data threshold to '0' and for some reason all my SNPs are filtered out.
However, if I use plink as below, I get a filtered file as expected: plink --vcf myvcf.gz --geno 0 --allow-extra-chr --keep 77_genomes.txt --recode --real-ref-alleles --out myvcf_zero_missing --make-bed
So, I am a bit confused as to why.
Cheers, Nic
On Fri, 24 Feb 2023 at 08:45, Nicolas Dussex @.***> wrote:
Hi Verena, I understand the reasoning behind the "-m snps" flag now.
Yes, I tried modifying the cluster file as follows:
. . partition: node time: 09:00:00 . .
merge and process VCF files
merge_all_vcfs: time: 2-23:40:00 cpus-per-task: 1
But it failed as described above. It generates the first merged vcf but then fails when filtering for biallelic sites. It also deletes the merged.bcf file.
Cheers, Nic
On Mon, 20 Feb 2023 at 09:41, Verena Kutschera @.***> wrote:
One more question before I change the code: I guess you have tried to increase the memory for this rule, ideally to the maximum? If I remove the "-m snps" flag, the output file will be very large so I'd like to avoid removing it (without it the file will contain all sites...).
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1436553515, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVI22CK67LIWZDRC2GLWYMU25ANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
Hi Nic! Thanks for looking into that, I'll try to re-create the error with my testdata and will get back to you.
Hi! I might have an idea: did you specify f_missing
in the config.yaml
file (or F_MISSING
in your manual script) as 0
or as 0.0
? In bcftools, the F_MISSING
parameter has to be a floating point number as it specifies the fraction of allowed missing genotypes. I didn't clarify this in the config.yaml
file properly, which I'll fix in the next version.
Hi Verena, I remember trying both as the code suggested. I'm doing another try now with 0.0, should be finished tomorrow. So, I'll let you know.
Nic
On Mon, 27 Feb 2023 at 13:50, Verena Kutschera @.***> wrote:
Hi! I might have an idea: did you specify f_missing in the config.yaml file (or F_MISSING in your manual script) as 0 or as 0.0? In bcftools, the F_MISSING parameter has to be a floating point number as it specifies the fraction of allowed missing genotypes.
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1446270437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVM5VQBXLNRUHR4T22LWZSPJDANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
Thank you for double checking!
Hi! I know what went wrong, 'F_MISSING < 0.0' can't work - it must be 'F_MISSING = 0.0'. I'm working on a solution and will create a new pipeline version as soon as I've tested it thoroughly. I'll let you know when the pipeline is functioning and available!
Hi Nic! There is a solution now available on the branch dev
that I tested with the test dataset. If you have the time, would you like to test it on your data? Here is the code if you would like to run it in a separate script:
bcftools view -i 'F_MISSING = 0.0' -Oz -o {output.vcf} {input.bcf} 2> {log}
Otherwise, you can run git checkout dev
to switch to the branch dev
with the new solution and run it in the pipeline.
Hi Verena,
I manually edited the rule yesterday and it is running now. I will let you know.
Thanks!
On Tue, 28 Feb 2023 at 10:41, Verena Kutschera @.***> wrote:
Hi Nic! There is a solution now available on the branch dev that I tested with the test dataset. If you have the time, would you like to test it on your data? Here is the code if you would like to run it in a separate script:
bcftools view -i 'F_MISSING = 0.0' -Oz -o {output.vcf} {input.bcf} 2> {log}
Otherwise, you can run git checkout dev to switch to the branch dev with the new solution and run it in the pipeline.
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1447864799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVO4JQKAGG3JQ2JQ7ELWZXB4FANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
Awesome, thanks!
To clarify, the updated code checks the value of f_missing
and runs bcftools in different ways for f_missing = 0.0, f_missing = 1.0, or a value 0.0 < f_missing < 1.0:
https://github.com/NBISweden/GenErode/pull/42/files?file-filters%5B%5D=.smk&show-viewed-files=true
# only include sites with zero missing data
if [[ `echo 0.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
bcftools view -i 'F_MISSING = {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include all sites
elif [[ `echo 1.0 {params.fmiss} | awk '{{print ($1 == $2)}}'` == 1 ]]
then
bcftools view -i 'F_MISSING <= {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
# include sites with less than the fraction f_missing of missing data
elif [[ `echo 0.0 {params.fmiss} 1.0 | awk '{{print ($1 < $2 && $2 < $3)}}'` == 1 ]]
then
bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}
fi
bcftools index -f {output.vcf} 2>> {log}
Hi Verena,
It seems to be working fine now. Thanks for fixing it so quickly.
Nic
On Tue, 28 Feb 2023 at 11:31, Verena Kutschera @.***> wrote:
Awesome, thanks!
To clarify, the updated code checks the value of f_missing and runs bcftools in different ways for f_missing = 0.0, f_missing = 1.0, or a value 0.0 < f_missing < 1.0:
https://github.com/NBISweden/GenErode/pull/42/files?file-filters%5B%5D=.smk&show-viewed-files=true
only include sites with zero missing data
if [[
echo 0.0 {params.fmiss} | awk '{{print ($1 == $2)}}'
== 1 ]] then bcftools view -i 'F_MISSING = {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}include all sites
elif [[
echo 1.0 {params.fmiss} | awk '{{print ($1 == $2)}}'
== 1 ]] then bcftools view -i 'F_MISSING <= {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log}include sites with less than the fraction f_missing of missing data
elif [[
echo 0.0 {params.fmiss} 1.0 | awk '{{print ($1 < $2 && $2 < $3)}}'
== 1 ]] then bcftools view -i 'F_MISSING < {params.fmiss}' -Oz -o {output.vcf} {input.bcf} 2> {log} fibcftools index -f {output.vcf} 2>> {log}
— Reply to this email directly, view it on GitHub https://github.com/NBISweden/GenErode/issues/39#issuecomment-1447938230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUDIVVL5D3WEIFTPTE64IPDWZXHYXANCNFSM6AAAAAAVBFLGJQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Nic Dussex PhD
Postdoctoral Researcher Department of Natural History
NTNU University Museum
7012 Trondheim
Thank you for testing the code! I'll just run some final quick checks and will release a new version with the corrected code.
Hi,
I am trying to run step 9 ( 9_merge_vcfs.smk) with the latest version of the pipeline but the first job of this rule (i.e. bcftools merge) always fails without any indication of the possible error in the slurm.out or log files. The first merged.bcf is then empty, only containing the header. It could be a memory issue (68 mammalian genomes), but I am not sure. If I run the rule manually outside the pipeline, the same first step fails as well.
Previously, I would run this rule manually with a slightly different script, but since I want to estimate load with GERP scores, I want to merge the vcfs inside the pipeline.
My guess is that the '-m snps' option is causing this problem in the bcftools merge step:
bcftools merge -m snps -O b -o {output.merged} {input.bcf} 2> {log}
I didn't use it in my manual merging script since the selection of snps only is used in the following step:
bcftools view -m2 -M2 -v snps -Ob -o {output.bcf} {input.bcf} 2> {log} &&
But maybe I am wrong.
Thanks for your help!