durwa004 / genetic_burden_pipeline

Scripts and tools to estimate the genetic burden as part of my first aim of my thesis.
0 stars 1 forks source link

Figure out union/intersect for genetic burden pipeline #1

Closed durwa004 closed 4 years ago

durwa004 commented 5 years ago
durwa004 commented 5 years ago

Need to update README.md

Convert annovar/snpeff _coding variants to a more user friendly table, so that I can get true union and intersect.

Get_annovar_snpeff_coding_tidy.py

Then get union/intersect just based on whether annovar/snpeff called them coding

durwa004 commented 5 years ago

Old scripts

Get union between annovar and snpeff

Get chrom/pos/impact of all coding variants

durwa004 commented 5 years ago

Number of coding variants:

Overlap between coding variants

durwa004 commented 5 years ago

Next - no. of variants in intersect per individual and per breed group

durwa004 commented 5 years ago

Currently working

durwa004 commented 5 years ago

Move avinput files to s3

$ qsub /home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/s3_scripts/s3cmd_put_annovar.pbs 
durwa004 commented 5 years ago

Extract breeds from thesis intersect vcf

$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b QH 
$ python ../../variant_calling/python_generation_scripts/Generate_pbs_submission_shell.py -d ../split_by_breed/
$ sh /home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/genetic_burden_pipeline/split_by_breed/pbs_shell.sh 

Arabian (35) Belgian (20) Clydesdale(19) Morgan (20) STB (48) TB (53) WP (20) Icelandic (17) Shetland (55) QH (70)

durwa004 commented 5 years ago

Move joint intersect chromosome files to s3

$ qsub /home/mccuem/shared/Projects/HorseGenomeProject/scripts/EquCab3/s3_scripts/s3cmd_put_joint_intersect.pbs 
durwa004 commented 5 years ago
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Shetland -m 3 -c 11
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b TB -m 3 -c 10
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b STB -m 3 -c 10
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Icelandic -m 1 -c 3
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b WP -m 1 -c 4
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Morgan -m 1 -c 4
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Belgian -m 1 -c 4
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Clydesdale -m 1 -c 4
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b QH -m 4 -c 14
$ python Generate_bcftools_view_by_breed.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -i /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/horse_genomes_breeds_tidy.txt -b Arabian -m 2 -c 7

This creates quite a lot of files with a lot of memory. These are the file output options:

The command line was:   bcftools isec  -p Arabian_rare_Belgian_common thesis_intersect_Arabian_rare.vcf.gz thesis_intersect_Belgian_common.vcf.gz

Using the following file names:
Arabian_rare_Belgian_common/0000.vcf    for records private to  thesis_intersect_Arabian_rare.vcf.gz
Arabian_rare_Belgian_common/0001.vcf    for records private to  thesis_intersect_Belgian_common.vcf.gz
Arabian_rare_Belgian_common/0002.vcf    for records from thesis_intersect_Arabian_rare.vcf.gz shared by both    thesis_intersect_Arabian_rare.vcf.gz thesis_intersect_Belgian_common.vcf.gz
Arabian_rare_Belgian_common/0003.vcf    for records from thesis_intersect_Belgian_common.vcf.gz shared by both  thesis_intersect_Arabian_rare.vcf.gz thesis_intersect_Belgian_common.vcf.gz

Looks like I can delete 0000, and 0001

durwa004 commented 5 years ago

Can probably also find variants that have an AF <0.5% in the whole population but over 5% in a population. Can probably also find variants that have an AF <3% in the breed but over 5% in the general population. Arabian (35): AF < 3% = AC: 2, AF > 5% = 4 Belgian (20): AF < 3% = AC: 1, AF > 5% = 2 Clydesdale(19): AF < 3% = AC: 1, AF > 5% = 2 Morgan (20): AF < 3% = AC: 1, AF > 5% = 2 STB (48): AF < 3% = AC: 3, AF > 5% = 5 TB (53): AF < 3% = AC: 3, AF > 5% = 5 WP (20): AF < 3% = AC: 1, AF > 5% = 2 Icelandic (17): AF < 3% = AC: 1, AF > 5% = 2 Shetland (55): AF < 3% = AC: 3, AF > 5% = 6 QH (70): AF < 3% = AC: 4, AF > 5% = 7

$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b Arabian -m3 2 -m5 4
$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b Belgian/Clydesdale/Morgan/WP/Icelandic -m3 1 -m5 2
$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b STB/TN -m3 3 -m5 5
$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b QH -m3 4 -m5 7
$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b Shetland -m3 3 -m5 6

Analyze this data.

durwa004 commented 5 years ago

Had to redo this because it wouldn't pull out the common/rare population variants so will have to use gatk for that.

$ python ../../python_scripts/Generate_bcftools_view_rare_variants_breed_common_variants_pop.py -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/ -b TB -m5 5
$ Get_rare_breed_common_pop_info.py 

Transfer to my laptop

$ scp durwa004@login.msi.umn.edu:/home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/common_breed_rare_pop.txt /Users/durwa004/Documents/PhD/Projects/1000_genomes/GB_project/breed_pop_variants/
$ scp durwa004@login.msi.umn.edu:/home/mccuem/shared/Projects/HorseGenomeProject/Data/ibio_EquCab3/ibio_output_files/joint_gvcf/joint_intersect/rare_breed_common_pop.txt /Users/durwa004/Documents/PhD/Projects/1000_genomes/GB_project/breed_pop_variants/
durwa004 commented 5 years ago

Need to redo all of this! Without the PRZE horses!!!!!!!

durwa004 commented 5 years ago

ERR1527968 - horse with no variants = removed

durwa004 commented 5 years ago

Things running:

durwa004 commented 5 years ago

Details about impact annovar: http://annovar.openbioinformatics.org/en/latest/user-guide/gene/ SnpEff: http://snpeff.sourceforge.net/SnpEff_manual.html

durwa004 commented 4 years ago

To do for paper: