Open billingross opened 2 months ago
Workflow outline
BED format
From the CHoP BED:
Confident that these do not match.
Paired Fastqs to Unmapped Bam task: https://github.com/gatk-workflows/seq-format-conversion/blob/master/paired-fastq-to-unmapped-bam.wdl
From Picard FastqToSam
documents:
java -jar picard.jar FastqToSam \
F1=forward_reads.fastq \
F2=reverse_reads.fastq \
O=unaligned_read_pairs.bam \
SM=sample001 \
RG=rg0013
Instructions for running Cromwell using Batch: https://cromwell.readthedocs.io/en/develop/tutorials/Batch101/
Broad hg19 reference files on GCP: gs://gcp-public-data--broad-references/hg19/v0
Command to run workflow on GCP:
java -Dconfig.file=google.conf -jar cromwell-87.jar run fastq-to-ubam.wdl -i fastq-to-ubam-inputs.json
Do QC with samtools flagstat
Try using the -L
argument to haplotype caller to only call regions from the provided BED file: https://gatk.broadinstitute.org/hc/en-us/articles/360035531852-Intervals-and-interval-lists
Error when trying to run HaplotypeCaller from trellis-v2-cromwell/HaplotypeCaller/ab9c11c3-019c-4e1c-8512-8d8e7b7b7b72/call-GatkHaplotypeCaller/stderr:
##### ERROR MESSAGE: SAM/BAM/CRAM file /mnt/disks/cromwell_root/trellis-v2-cromwell/FastqToBam/c826a234-b89d-471f-a995-0be9988e590b/call-BwaMem/sample_pe.sorted.bam is malformed. Please see http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-input-files-for-sequence-read-data-bam-cramfor more information. Error details: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
##### ERROR ------------------------------------------------------------------------------------------
This thread indicates solution is to add the following argument when running BWA
-R @RG\tID:foo\tSM:bar
Guide to post-call filtering using bcftools: https://www.htslib.org/workflow/filter.html
GATK error reading bed file from stderr
##### ERROR MESSAGE: File associated with name /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed is malformed: Problem reading the interval file caused by Error parsing line at byte position: htsjdk.tribble.readers.LineIteratorImpl@61c4cebd, for input source: /mnt/disks/cromwell_root/trellis-v2-chop/roi.bed
##### ERROR ------------------------------------------------------------------------------------------
I noticed that the BED file didn't match the standard format so I'm just going to try chopping the non-standard columns
Chopping file command:
awk '{print $1,$2,$3,$4,$5,$6}' file
Variants called just for BED region: trellis-v2-cromwell/HaplotypeCaller/1d082cd2-5e5a-4866-9dc4-8241b38ad079/call-GatkHaplotypeCaller
VCF size: 4.8kb
Create tab-delimited bcftools annotation file with columns
Biostars reference: https://www.biostars.org/p/122690/ Bcftools annotate: https://samtools.github.io/bcftools/bcftools.html#annotate
awk '{print $1,$2,$3,$7}' roi.bed > gene_symbols.tab
bgzip gene_symbols.tab
tabix -s 1 -b 2 -e 3 gene_symbols.tab.gz
Instead of parsing gene symbols from BED I'm going to try getting GFF3 from Gencode: https://www.gencodegenes.org/human/release_19.html.
Download link: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz
I'm going to try using VEP to do the gene annotation instead, per Dave Tang's blog: https://davetang.org/muse/2018/01/05/annotating-variants-custom-file/
Alternately, what if I try adding a header to my tab file
OK the tabix issues seems to be that, despite my best efforts, the separators were space not tabs.
Also, the BED file is NOT sorted by initial position because there are multiple coding regions that overlap.
Extract just the gene entries from the gff3 file:
awk -F'\t' '$3 == "gene"' gencode.v19.annotation.gff3 > gencode.v19.annotation.genes.gff3
https://stackoverflow.com/questions/5374239/tab-separated-values-in-awk
Just use ENSEMBL genes:
awk -F'\t' '$2 == "ENSEMBL"' gencode.v19.annotation.genes.gff3 > gencode.v19.annotation.genes.ensembl.gff3
I have generated a tabix index for my GFF3 file with only ENSEMBL genes: gencode.v19.annotation.genes.ensembl.gff3.gz.tbi
OK here's how I think this should work using bcftools annotate
based on this biostars thread:
Example command:
bcftools annotate \
-a H37Rv-ref.tab.gz \
-h header.hdr \
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \
variants.vcf > annotated.vcf
My command:
bcftools annotate \
-a gencode.v19.annotation.genes.ensembl.gff3.gz
-h header.hdr \
-c CHROM,-,-,FROM,TO,
Actually, nevermind. The GFF doesn't work; it doesn't even have the gene symbol and it's buried in a composite field.
Instead I'm trying using the feature table from NCBI: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/
Limiting to protein coding genes on a genuine chromosome:
awk -F'\t' '$1 == "gene"' GCF_000001405.25_GRCh37.p13_feature_table.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes.txt
awk -F'\t' '$2 == "protein_coding"' GCF_000001405.25_GRCh37.p13_feature_table_genes.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt
awk -F'\t' '$5 == "chromosome"' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt
bgzip GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt -o GCF_000001405.25_GRCh37.p13_feature_table_genes_coding.txt.gz
Generate tabix index:
tabix -s 6 -b 8 -e 9 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz
bcftools annotate command
Example:
bcftools annotate \
-a H37Rv-ref.tab.gz \
-h header.hdr \
-c CHROM,FROM,TO,-,INFO/locus_tag,-,-,INFO/gene,INFO/product \
variants.vcf > annotated.vcf
./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz \
-h header.hdr \
-c -,-,-,-,-,CHROM,-FROM,TO,-,-,INFO/gene,-,- \
sample_pe.vcf > annotated_sample_pe.vcf
Example line from file:
gene protein_coding GCF_000001405.25 Primary Assembly chromosome 1 NC_000001.10 65419 71585 + olfactory receptor family 4 subfamily F member 5 OR4F5 79501 6167
My header.hdr:
##INFO=<ID=gene,Number=1,Type=String,Description="Gene">
OK, this failed on the first line:
Could not parse tab line: gene protein_coding GCF_000001405.25 Primary Assembly chromosome 1 NC_000001.10 65419 71585 + olfactory receptor family 4 subfamily F member 5 OR4F5 79501 6167
Failed to parse: GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt.gz
Maybe I need the header line in the annotation file?
Original header line:
# feature class assembly assembly_unit seq_type chromosome genomic_accession start end strand product_accession non-redundant_refseq related_accession name symbol GeneID locus_tag feature_interval_length product_length attributes
My header
feature
Just get CHROM, FROM, TO, SYMBOL and read/write as TSV:
awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table.txt | head
awk -v FS='\t' -v OFS='\t' '{print $6,$8,$9,$15}' GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr.txt > GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt
Tabix command:
tabix -s 1 -b 2 -e 3 GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz
Annotate command:
./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf > annotated_sample_pe.vcf
Annotate command:
./bcftools/bcftools annotate \
-a GCF_000001405.25_GRCh37.p13_feature_table_genes_coding_chr_trunc.txt.gz \
-h header.hdr \
-c CHROM,FROM,TO,INFO/gene \
sample_pe.vcf.gz > annotated_sample_pe.vcf
Quality thresholds:
Filter based on depth
./bcftools/bcftools view -e 'INFO/DP < 3 || FORMAT/GQ < 7' full_sample_pe.vcf.gz
Stats
./bcftools/bcftools stats filtered_annotated_sample_pe.vcf
Assignment instructions:
[x] Align the reads to hg19 build 37 genome
[x] Generate quality statistics from the fastq or bam file using your tool of choice.
[x] Call variants for the regions in test.bed (roi.bed) using your tool of choice
[x] Annotate the variants with gene symbol (preferable)
[ ] Apply some filtrations based on your knowledge (preferable)