freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
105 stars 8 forks source link

Queries over installation #18

Open jinghuazhao opened 3 months ago

jinghuazhao commented 3 months ago

Hi Giulio / All,

I tried to get +liftover going on our systems but found location of a couple of files needs to be detailed -- could anybody point to me?

They are 1kGP_high_coverage_Illumina.sites.vcf.gz,, hs1.fa -- I made attempt to replace them as in here, https://cambridge-ceu.github.io/csd3/applications/bcftools.html#120

As documented files are stored in $HOME/GCRh38 where our system has a limit on $HOME I tried to get them elswhere. In particular, the *.so files are with bin which appears to be available through LD_LIBRARY_PATH as well?

Your comments are greatly appreicated,

JIng Hua

freeseek commented 3 months ago

Thank you for the heads up. I need to fix the documentation but to generate the sites only file, this command will do (but it will take a long time):

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr{{1..22}.filtered.SNV_INDEL_SV_phased_panel,X.filtered.SNV_INDEL_SV_phased_panel.v2}.vcf.gz
for chr in {1..22} X; do
  if [ $chr == "X" ]; then sfx=".v2"; else sfx=""; fi
  bcftools view --no-version -Ou -c 2 1kGP_high_coverage_Illumina.chr$chr.filtered.SNV_INDEL_SV_phased_panel$sfx.vcf.gz | \
  bcftools annotate --no-version -Ou -x ID,QUAL,FILTER,^INFO/AC,^INFO/AN,^INFO/END,^FMT/GT | \
  bcftools sort -o 1kGP_high_coverage_Illumina.chr$chr.bcf -Ob -T ./bcftools. --write-index
done

bcftools concat --no-version -Ou 1kGP_high_coverage_Illumina.chr{{1..22},X}.bcf | \
  bcftools view --no-version -G -Ob -o 1kGP_high_coverage_Illumina.sites.bcf --write-index

You don't need to store files in the $HOME/GCRh38 directory as any directory of your choice will do. This directory is in the documentation but it is not in the plugins code

As for the *.so files, say you want to run score.so located in the directory /path/to/bcftools/plugins, you can run it as follows (four different options):

export BCFTOOLS_PLUGINS=/path/to/bcftools/plugins && bcftools +score
export BCFTOOLS_PLUGINS=/path/to/bcftools/plugins && bcftools plugin score
bcftools +$BCFTOOLS_PLUGINS/score.so
bcftools plugin $BCFTOOLS_PLUGINS/score.so

You can also compile BCFtools with a default plugin directory which will be hard coded in the binary, but this is not required

jinghuazhao commented 3 months ago

Thanks for the feedback -- indeed it appears possible to redinfe HOME to make install there. I did this in a rush without much thinking and I will try your script possibly in SLURM.

Also for information, I have my module file listed below -- it does appear to work.

> #%Module -*- tcl -*-
> ##
> ## modulefile
> ##
> proc ModulesHelp { } {
> 
>   puts stderr "\tbcftools - utilities for variant calling and manipulating VCFs and BCFs\n"
>   puts stderr "\tInstalled under: /usr/local/Cluster-Apps/ceuadmin/bcftools/1.20"
>   puts stderr "\tHomepage: https://www.htslib.org/download/"
> 
> }
> 
> module-whatis "utilities for variant calling and manipulating VCFs and BCFs."
> 
> set               root                  /usr/local/Cluster-Apps/ceuadmin/bcftools/1.20
> prepend-path      PATH                  $root/bin
> prepend-path      INCLUDE               $root/include
> prepend-path      MANPATH               $root/share/man
> prepend-path      LD_LIBRARY_PATH       $root/lib:$root/libexec/bcftools
> prepend-path      LIBRARY_PATH          $root/lib:$root/libexec/bcftools
> 
> setenv            BCFTOOLS_PLUGINS      $root/score
> 
jinghuazhao commented 3 months ago

Actually setenv in module definition is equivalent to your export under Bash, so it is as indicated. The script for 1kGP_high_coverage_Illumina.sites.bcf generation took ~6hr on our system., which is presumbly (?) used in

cftools norm --no-version -Ou -m+ 1kGP_high_coverage_Illumina.sites.bcf | \ bcftools +liftover --no-version -Ou -- \ -s $public_databases/dbsnp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \ -f $public_databases/GRCh37_reference_fasta/hs1.fa \ -c $public_databases/dbsnp/hg38ToHs1.over.chain.gz | \ bcftools sort -o 1kGP_high_coverage_Illumina.sites.hs1.bcf -Ob --write-index

but I got error message: on AC and again warning if with (I reckon bcftools norm not applied properly)

bcftools norm --no-version -Ou -m+ 1kGP_high_coverage_Illumina.sites.bcf | \ bcftools annotate -x INFO/AC -Ou | \ bcftools +fill-tags -Ou -- -t AC | \ bcftools +liftover --no-version -Ou - -- \ -s $public_databases/dbsnp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \ -f $public_databases/GRCh37_reference_fasta/hs1.fa \ -c $public_databases/dbsnp/hg38ToHs1.over.chain.gz | \ bcftools sort -o 1kGP_high_coverage_Illumina.sites.hs1.bcf -Ob -T ./bcftools --write-index

and for the module file I need to use setenv BCFTOOLS_PLUGINS $root/score:$root/libexec/bcftools

freeseek commented 3 months ago

If you used the command above to generate the 1kGP_high_coverage_Illumina.sites.bcf file, then this should be a file without genotypes. You cannot fill the AC tags without genotypes, but if you followed the instructions I sent you should not have to as you should already have that tag. Also notice that bcftools +fill-tags -- -t AC is terribly slow and bcftools +fill-AN-AC is recommended instead

jinghuazhao commented 3 months ago

Thanks very much for the comments. I attempted to walk through the last chunk of your "Liftover VCFs" section, which is

bcftools norm --no-version -Ou -m+ 1kGP_high_coverage_Illumina.sites.vcf.gz | \ bcftools +liftover --no-version -Ou -- \ -s $HOME/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \ -f $HOME/hs1/hs1.fa \ -c $HOME/hs1/hg38ToHs1.over.chain.gz \ bcftools sort -o 1kGP_high_coverage_Illumina.sites.hs1.bcf -Ob --write-index

but went astray. As you can see that I am far from knowing bcftools enough, and therefor would try your notes above as well. BTW, the code you posted yesterday (wrapped in here, https://cambridge-ceu.github.io/csd3/applications/files/site.sb) did go through smoothly. I am not sure this route is valid at all as I got a warning in both cases,

bcftools norm --no-version -Ou -m+ ../1000G/bcftools/1kGP_high_coverage_Illumina.sites.bcf | \
> bcftools annotate -x INFO/AC -Ou | \
> bcftools +fill-AN-AC -Ou -- | \
> bcftools +liftover --no-version -Ou - -- \
> -s $public_databases/dbsnp/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
> -f $public_databases/GRCh37_reference_fasta/hs1.fa \
> -c $public_databases/dbsnp/hg38ToHs1.over.chain.gz | \
> bcftools sort -o 1kGP_high_coverage_Illumina.sites.hs1.bcf -Ob -T ./bcftools --write-index
Writing to ./bcftoolsxUm4P1
INFO/AC is handled by AC rule
Warning: input VCF includes symbolic alleles that might not properly lift over
freeseek commented 3 months ago

Again, you don't need to use bcftools +fill-AN-AC as it will not do anything on a file without genotypes. You are getting the INFO/AC is handled by AC rule message only because the field is in the header, but it will never be present if you remove it first from a VCF without genotypes. As for the warning, that is due to DEL/DUP variants in the VCF which will not always liftover properly as BCFtools/liftover is designed for SNVs and indels. CNVs cannot be easily lifted over in all scenarios

jinghuazhao commented 3 months ago

It looks I need to slow down and digest what bcftools does in these cases. Thanks very much for your time..