billingross / chop-variant-calling

MIT License
0 stars 0 forks source link

Create a brief variant calling workflow #9

Open billingross opened 2 months ago

billingross commented 2 months ago

Assignment instructions:

billingross commented 2 months ago

Workflow outline

billingross commented 2 months ago

BED format

From the CHoP BED:

Confident that these do not match.

billingross commented 2 months ago

Paired Fastqs to Unmapped Bam task: https://github.com/gatk-workflows/seq-format-conversion/blob/master/paired-fastq-to-unmapped-bam.wdl

billingross commented 2 months ago

From Picard FastqToSam documents:

java -jar picard.jar FastqToSam \
       F1=forward_reads.fastq \
       F2=reverse_reads.fastq \
       O=unaligned_read_pairs.bam \
       SM=sample001 \
       RG=rg0013
billingross commented 2 months ago

Instructions for running Cromwell using Batch: https://cromwell.readthedocs.io/en/develop/tutorials/Batch101/

billingross commented 1 month ago

Broad hg19 reference files on GCP: gs://gcp-public-data--broad-references/hg19/v0

billingross commented 1 month ago

Command to run workflow on GCP:

java -Dconfig.file=google.conf -jar cromwell-87.jar run fastq-to-ubam.wdl -i fastq-to-ubam-inputs.json
billingross commented 1 month ago

Do QC with samtools flagstat

billingross commented 1 month ago

Samtools course: https://davetang.org/wiki/tiki-index.php?page=SAMTools#Creating_a_BAM_index_file

billingross commented 1 month ago
billingross commented 1 month ago

Try using the -L argument to haplotype caller to only call regions from the provided BED file: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

billingross commented 1 month ago

Error when trying to run HaplotypeCaller from trellis-v2-cromwell/HaplotypeCaller/ab9c11c3-019c-4e1c-8512-8d8e7b7b7b72/call-GatkHaplotypeCaller/stderr:

##### ERROR MESSAGE: SAM/BAM/CRAM file /mnt/disks/cromwell_root/trellis-v2-cromwell/FastqToBam/c826a234-b89d-471f-a995-0be9988e590b/call-BwaMem/sample_pe.sorted.bam is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header.  The GATK no longer supports SAM files without read groups
##### ERROR ------------------------------------------------------------------------------------------

This thread indicates solution is to add the following argument when running BWA

-R @RG\tID:foo\tSM:bar
billingross commented 1 month ago

Guide to post-call filtering using bcftools: https://www.htslib.org/workflow/filter.html

billingross commented 1 month ago

GATK error reading bed file from stderr

##### ERROR MESSAGE: File associated with name /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed is malformed: Problem reading the interval file caused by Error parsing line at byte position: htsjdk.tribble.readers.LineIteratorImpl@61c4cebd, for input source: /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed
##### ERROR ------------------------------------------------------------------------------------------

I noticed that the BED file didn't match the standard format so I'm just going to try chopping the non-standard columns

billingross commented 1 month ago

Chopping file command:

awk '{print $1,$2,$3,$4,$5,$6}' file
billingross commented 1 month ago

Variants called just for BED region: trellis-v2-cromwell/HaplotypeCaller/1d082cd2-5e5a-4866-9dc4-8241b38ad079/call-GatkHaplotypeCaller

VCF size: 4.8kb

billingross commented 1 month ago

Variants called without BED: https://console.cloud.google.com/storage/browser/trellis-v2-cromwell/HaplotypeCaller/47238bad-c9fd-4e0b-9076-f21c4764f455/call-GatkHaplotypeCaller?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=trellis-v2

VCF size: 188.9 KB

billingross commented 1 month ago

Create tab-delimited bcftools annotation file with columns

Biostars reference: https://www.biostars.org/p/122690/ Bcftools annotate: https://samtools.github.io/bcftools/bcftools.html#annotate

awk '{print $1,$2,$3,$7}' roi.bed > gene_symbols.tab
bgzip gene_symbols.tab
tabix -s 1 -b 2 -e 3 gene_symbols.tab.gz
billingross commented 1 month ago

Instead of parsing gene symbols from BED I'm going to try getting GFF3 from Gencode: https://www.gencodegenes.org/human/release_19.html.

Download link: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz

billingross commented 1 month ago

I'm going to try using VEP to do the gene annotation instead, per Dave Tang's blog: https://davetang.org/muse/2018/01/05/annotating-variants-custom-file/

billingross commented 1 month ago

Alternately, what if I try adding a header to my tab file

billingross commented 1 month ago

OK the tabix issues seems to be that, despite my best efforts, the separators were space not tabs.

Also, the BED file is NOT sorted by initial position because there are multiple coding regions that overlap.

billingross commented 1 month ago

Extract just the gene entries from the gff3 file:

awk -F'\t' '$3 == "gene"' gencode.v19.annotation.gff3 > gencode.v19.annotation.genes.gff3

https://stackoverflow.com/questions/5374239/tab-separated-values-in-awk

billingross commented 1 month ago

Just use ENSEMBL genes:

awk -F'\t' '$2 == "ENSEMBL"' gencode.v19.annotation.genes.gff3 > gencode.v19.annotation.genes.ensembl.gff3
billingross commented 1 month ago

I have generated a tabix index for my GFF3 file with only ENSEMBL genes: gencode.v19.annotation.genes.ensembl.gff3.gz.tbi

billingross commented 1 month ago

OK here's how I think this should work using bcftools annotate based on this biostars thread:

Example command:

bcftools annotate \  
-a H37Rv-ref.tab.gz \  
-h header.hdr \  
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \   
variants.vcf > annotated.vcf

My command:

bcftools annotate \
-a gencode.v19.annotation.genes.ensembl.gff3.gz
-h header.hdr \
-c CHROM,-,-,FROM,TO,
billingross commented 1 month ago

Actually, nevermind. The GFF doesn't work; it doesn't even have the gene symbol and it's buried in a composite field.

Instead I'm trying using the feature table from NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/

Limiting to protein coding genes on a genuine chromosome:

awk -F'\t' '$1 == "gene"' GCF_000001405.25_GRCh37.p13_feature_table.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes.txt
awk -F'\t' '$2 == "protein_coding"' GCF_000001405.25_GRCh37.p13_feature_table_genes.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt
awk -F'\t' '$5 == "chromosome"' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt
bgzip GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt -o GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt.gz

Generate tabix index:

tabix -s 6 -b 8 -e 9 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz
billingross commented 1 month ago

bcftools annotate command

Example:

bcftools annotate \  
-a H37Rv-ref.tab.gz \  
-h header.hdr \  
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \   
variants.vcf > annotated.vcf
./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz \
-h header.hdr \
-c -,-,-,-,-,CHROM,-FROM,TO,-,-,INFO/gene,-,- \
sample_pe.vcf > annotated_sample_pe.vcf

Example line from file:

gene    protein_coding  GCF_000001405.25        Primary Assembly        chromosome      1       NC_000001.10    65419   71585   +                               olfactory receptor family 4 subfamily F member 5     OR4F5   79501           6167

My header.hdr:

##INFO=<ID=gene,Number=1,Type=String,Description="Gene">
billingross commented 1 month ago

OK, this failed on the first line:

Could not parse tab line: gene  protein_coding  GCF_000001405.25        Primary Assembly        chromosome      1       NC_000001.10    65419   71585   +           olfactory receptor family 4 subfamily F member 5 OR4F5   79501           6167
Failed to parse: GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz

Maybe I need the header line in the annotation file?

billingross commented 1 month ago

Original header line:

# feature       class   assembly        assembly_unit   seq_type        chromosome      genomic_accession       start   end     strand  product_accession       non-redundant_refseq related_accession       name    symbol  GeneID  locus_tag       feature_interval_length product_length  attributes

My header


feature
billingross commented 1 month ago

Just get CHROM, FROM, TO, SYMBOL and read/write as TSV:

awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table.txt | head
awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt
billingross commented 1 month ago

Tabix command:

tabix -s 1 -b 2 -e 3 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz

Annotate command:

./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf > annotated_sample_pe.vcf

Annotate command:

./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf.gz > annotated_sample_pe.vcf
billingross commented 1 month ago

Quality thresholds:

Filter based on depth

./bcftools/bcftools view -e 'INFO/DP < 3 || FORMAT/GQ < 7' full_sample_pe.vcf.gz 
billingross commented 1 month ago

Stats

./bcftools/bcftools stats filtered_annotated_sample_pe.vcf