Create a brief variant calling workflow

billingross commented 2 months ago

Assignment instructions:

[x] Align the reads to hg19 build 37 genome
[x] Generate quality statistics from the fastq or bam file using your tool of choice.
[x] Call variants for the regions in test.bed (roi.bed) using your tool of choice
[x] Annotate the variants with gene symbol (preferable)
[ ] Apply some filtrations based on your knowledge (preferable)

billingross commented 2 months ago

Workflow outline

bwa-mem to align read to hg37
FastQC to generate quality metrics from the BAM
Use samtools to extract reads that are mapped to the BED region
Use standard GATK tools to perform standard BAM QC
Use haplotype caller to call variants
Annotate variants with gene symbol (https://www.biostars.org/p/122690/)
Apply some filtrations (probably bcftools)

billingross commented 2 months ago

BED format

chrom
chromStart
chromEnd
name
score
strand
thickStart
thickEnd
itemRgb
blockCount
blockSizes
blockStarts

From the CHoP BED:

chrom: 1
chromStart: 160259931
chromEnd: 160260021
name: NM_004371.3_cds_33
score: 0
strand: -
thickStart: COPA
thickEnd: chr1:160258377-160313354
itemRgb: 2
blockCount: 90
blockSizes: 90
blockStarts: 1.0000000

Confident that these do not match.

billingross commented 2 months ago

Paired Fastqs to Unmapped Bam task: https://github.com/gatk-workflows/seq-format-conversion/blob/master/paired-fastq-to-unmapped-bam.wdl

billingross commented 2 months ago

From Picard FastqToSam documents:

java -jar picard.jar FastqToSam \
       F1=forward_reads.fastq \
       F2=reverse_reads.fastq \
       O=unaligned_read_pairs.bam \
       SM=sample001 \
       RG=rg0013

billingross commented 2 months ago

Instructions for running Cromwell using Batch: https://cromwell.readthedocs.io/en/develop/tutorials/Batch101/

billingross commented 1 month ago

Broad hg19 reference files on GCP: gs://gcp-public-data--broad-references/hg19/v0

billingross commented 1 month ago

Command to run workflow on GCP:

java -Dconfig.file=google.conf -jar cromwell-87.jar run fastq-to-ubam.wdl -i fastq-to-ubam-inputs.json

billingross commented 1 month ago

Do QC with samtools flagstat

billingross commented 1 month ago

Samtools course: https://davetang.org/wiki/tiki-index.php?page=SAMTools#Creating_a_BAM_index_file

billingross commented 1 month ago

[x] Generate hg19 bwa index
[x] Run bwa-mem
[x] Run samtools
[x] Run samtools flagstat
[x] Perform variant calling using HaplotypeCaller
[x] Perform variant calling on BED region
[x] Created tab-delimited bcftools annotation file
[x] Annotate variants with gene symbol
[ ] Filter VCF based on quality metrics

billingross commented 1 month ago

Try using the -L argument to haplotype caller to only call regions from the provided BED file: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists

billingross commented 1 month ago

Error when trying to run HaplotypeCaller from trellis-v2-cromwell/HaplotypeCaller/ab9c11c3-019c-4e1c-8512-8d8e7b7b7b72/call-GatkHaplotypeCaller/stderr:

##### ERROR MESSAGE: SAM/BAM/CRAM file /mnt/disks/cromwell_root/trellis-v2-cromwell/FastqToBam/c826a234-b89d-471f-a995-0be9988e590b/call-BwaMem/sample_pe.sorted.bam is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header.  The GATK no longer supports SAM files without read groups
##### ERROR ------------------------------------------------------------------------------------------

This thread indicates solution is to add the following argument when running BWA

-R @RG\tID:foo\tSM:bar

billingross commented 1 month ago

Guide to post-call filtering using bcftools: https://www.htslib.org/workflow/filter.html

billingross commented 1 month ago

GATK error reading bed file from stderr

##### ERROR MESSAGE: File associated with name /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed is malformed: Problem reading the interval file caused by Error parsing line at byte position: htsjdk.tribble.readers.LineIteratorImpl@61c4cebd, for input source: /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed
##### ERROR ------------------------------------------------------------------------------------------

I noticed that the BED file didn't match the standard format so I'm just going to try chopping the non-standard columns

billingross commented 1 month ago

Chopping file command:

awk '{print $1,$2,$3,$4,$5,$6}' file

billingross commented 1 month ago

Variants called just for BED region: trellis-v2-cromwell/HaplotypeCaller/1d082cd2-5e5a-4866-9dc4-8241b38ad079/call-GatkHaplotypeCaller

VCF size: 4.8kb

billingross commented 1 month ago

Variants called without BED: https://console.cloud.google.com/storage/browser/trellis-v2-cromwell/HaplotypeCaller/47238bad-c9fd-4e0b-9076-f21c4764f455/call-GatkHaplotypeCaller?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&project=trellis-v2

VCF size: 188.9 KB

billingross commented 1 month ago

Create tab-delimited bcftools annotation file with columns

CHROM
FROM
TO
GENE

Biostars reference: https://www.biostars.org/p/122690/ Bcftools annotate: https://samtools.github.io/bcftools/bcftools.html#annotate

awk '{print $1,$2,$3,$7}' roi.bed > gene_symbols.tab
bgzip gene_symbols.tab
tabix -s 1 -b 2 -e 3 gene_symbols.tab.gz

billingross commented 1 month ago

Instead of parsing gene symbols from BED I'm going to try getting GFF3 from Gencode: https://www.gencodegenes.org/human/release_19.html.

Download link: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz

billingross commented 1 month ago

I'm going to try using VEP to do the gene annotation instead, per Dave Tang's blog: https://davetang.org/muse/2018/01/05/annotating-variants-custom-file/

billingross commented 1 month ago

Alternately, what if I try adding a header to my tab file

billingross commented 1 month ago

OK the tabix issues seems to be that, despite my best efforts, the separators were space not tabs.

Also, the BED file is NOT sorted by initial position because there are multiple coding regions that overlap.

billingross commented 1 month ago

Extract just the gene entries from the gff3 file:

awk -F'\t' '$3 == "gene"' gencode.v19.annotation.gff3 > gencode.v19.annotation.genes.gff3

https://stackoverflow.com/questions/5374239/tab-separated-values-in-awk

billingross commented 1 month ago

Just use ENSEMBL genes:

awk -F'\t' '$2 == "ENSEMBL"' gencode.v19.annotation.genes.gff3 > gencode.v19.annotation.genes.ensembl.gff3

billingross commented 1 month ago

I have generated a tabix index for my GFF3 file with only ENSEMBL genes: gencode.v19.annotation.genes.ensembl.gff3.gz.tbi

billingross commented 1 month ago

OK here's how I think this should work using bcftools annotate based on this biostars thread:

Example command:

bcftools annotate \  
-a H37Rv-ref.tab.gz \  
-h header.hdr \  
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \   
variants.vcf > annotated.vcf

My command:

bcftools annotate \
-a gencode.v19.annotation.genes.ensembl.gff3.gz
-h header.hdr \
-c CHROM,-,-,FROM,TO,

billingross commented 1 month ago

Actually, nevermind. The GFF doesn't work; it doesn't even have the gene symbol and it's buried in a composite field.

Instead I'm trying using the feature table from NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/

Limiting to protein coding genes on a genuine chromosome:

awk -F'\t' '$1 == "gene"' GCF_000001405.25_GRCh37.p13_feature_table.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes.txt
awk -F'\t' '$2 == "protein_coding"' GCF_000001405.25_GRCh37.p13_feature_table_genes.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt
awk -F'\t' '$5 == "chromosome"' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt

bgzip GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt -o GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt.gz

Generate tabix index:

tabix -s 6 -b 8 -e 9 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz

billingross commented 1 month ago

bcftools annotate command

Example:

bcftools annotate \  
-a H37Rv-ref.tab.gz \  
-h header.hdr \  
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \   
variants.vcf > annotated.vcf

./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz \
-h header.hdr \
-c -,-,-,-,-,CHROM,-FROM,TO,-,-,INFO/gene,-,- \
sample_pe.vcf > annotated_sample_pe.vcf

Example line from file:

gene    protein_coding  GCF_000001405.25        Primary Assembly        chromosome      1       NC_000001.10    65419   71585   +                               olfactory receptor family 4 subfamily F member 5     OR4F5   79501           6167

My header.hdr:

##INFO=<ID=gene,Number=1,Type=String,Description="Gene">

billingross commented 1 month ago

OK, this failed on the first line:

Could not parse tab line: gene  protein_coding  GCF_000001405.25        Primary Assembly        chromosome      1       NC_000001.10    65419   71585   +           olfactory receptor family 4 subfamily F member 5 OR4F5   79501           6167
Failed to parse: GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz

Maybe I need the header line in the annotation file?

billingross commented 1 month ago

Original header line:

# feature       class   assembly        assembly_unit   seq_type        chromosome      genomic_accession       start   end     strand  product_accession       non-redundant_refseq related_accession       name    symbol  GeneID  locus_tag       feature_interval_length product_length  attributes

My header


feature

billingross commented 1 month ago

Just get CHROM, FROM, TO, SYMBOL and read/write as TSV:

awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table.txt | head

awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt

billingross commented 1 month ago

Tabix command:

tabix -s 1 -b 2 -e 3 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz

Annotate command:

./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf > annotated_sample_pe.vcf

Annotate command:

./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf.gz > annotated_sample_pe.vcf

billingross commented 1 month ago

Quality thresholds:

QUAL score of 30
DP < 3

Filter based on depth

./bcftools/bcftools view -e 'INFO/DP < 3 || FORMAT/GQ < 7' full_sample_pe.vcf.gz

billingross commented 1 month ago

Stats

./bcftools/bcftools stats filtered_annotated_sample_pe.vcf

billingross / chop-variant-calling

Create a brief variant calling workflow #9