Long read sequencing alignment and (joint) variant calling

WimSpee commented 4 years ago

Dear bcbio developers,

For some of the species that we work on we are now starting to receive significant numbers of long read sequenced samples. The samples are sequenced using nanopore, PacBio CLR or PacBio HiFi.

I noticed that you made a list of priorities, which totally makes sense. Long read sequencing is not yet on that list (as far as I can tell) as a priority or a (bottom) of backlog item.

We are happy users of bcbio for analyzing short read sequencing data. It would be very cool if bcbio also would be able to process long read sequencing data.

Specifically it would be nice if there was a best practice pipeline in bcbio for alignment and variant calling of long read sequencing data. For example using NGMLR or Minimap2 for alignment, Longshot for variant calling and Sniffles for detecting structural variants.

A very nice, but probably very difficult feature, would be joint analysis creating squared off short variant and SV/CNV tables based on Illumina and PacBio/Nanopore data. Unless significant support for that would be added in e.g. a new GATK version (which you maybe know is or isn't to be expected soon).

I can totally understand if support for long read sequencing data is not (yet) a priority. Anyhow I would be interested in your thoughts about support in bcbio and best practices for long read sequencing data analysis in general.

Thank you for your thoughts on this.

naumenko-sa commented 4 years ago

Thanks for starting the discussion!

We also receive some long-read sequencing projects from time to time, so I think it is a totally valid discussion. Let us keep it open for some time to gather more opinions.

Attn @ZhuZhuoHSPH

Sergey

vdejager commented 3 years ago

Wim has a good point here.

I would like to throw in the long-read RNA-seq discussion as well. We get more and more requests for differential analysis between tissues with respect to both expression as well as isoform usage. At the moment we use SQANTI to analyse such data and the bcbio infrastructure could be very useful and would amend the RNA-seq features already in place in bcbio

Vic

WimSpee commented 3 years ago

PacBio HiFi BAM files seem to already be accepted by GATK4. https://www.pacb.com/wp-content/uploads/Application-Brief-Variant-detection-using-whole-genome-sequencing-with-HiFi-reads-Best-Practices.pdf

I prefer GATK4 over GoogleDeepVariant since it is easier to run on local infrastructure and is more of a standard tool already used widely for Illumina reads. GoogleDeepVariants is currently better at calling indels on PacBio HiFi reads.

I think minimap2 already supports PacBio HiFi reads (=PacBio CCS) ./minimap2 -ax asm20 ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio CCS genomic reads

PacBio HiFi reads might be easiest to implement in bcbio. Since they are long and correct and therefore can just be treated as long Illumina reads, using most of the existing Illumina bio-informatics tools in bcbio? sequencing_type_comparison https://www.nature.com/articles/s41576-020-0236-x

Only issue is the remaining homopolymer sequencing error profile in PacBio HiFi. Hopefully these get modeled in GATK4?

If GATK4 just takes in these PacBio HiFi BAM files as 'long read Illumina BAM files' then it is also easy to run a joint analysis using real short read Illumina BAM files?

I hope this information is useful.

WimSpee commented 3 years ago

@naumenko-sa I thought I just give it a try. There are public PacBio HiFI datasets available:

https://www.pacb.com/smrt-science/smrt-resources/datasets/ https://www.pacb.com/blog/hifi-data-release/

From these I chose the public HiFi reads for species Oryza sativa sample Minghui 63: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA558396

I also downloaded some random public Oryza sativa Illumina samples: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA658863

Only later I found there is public Illumina data also for Minghui 63. I did not yet include these. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA276972

I could start a joint alignment and variant calling analysis from fastq in bcbio with the combined set of Illumina and PacBio HiFi reads. This used the aligner bwa-mem and GATK4 for creating the GVCF files and the multi-sample VCF.

The resulting BAM files look good to me. Top track HiFi, bottom track Illumina. By accident the HiFi and Illumina samples by accident are similar guess. IGV_HiFi_and_short_Illumina

I did not use auto-consensus mode in IGV. HiFi reads still look accurate, mostly noise free, against the reference.
As expected a long 156 inertion can be seen in the HiFi reads. Which is probably also in the Illumina sample but can't be found using 150bp reads.

The resulting multi-sample short variant table also look as expected to me. HiFi_Illumina_snp_table

As expected:

There are false positive SNPs and Indels in the PacBio HiFi samples. Not all of the first 5 HiFi samples (which are all Minghui 63 ), always have the same (reference) genotype, sometimes 1 or more of the HiFi samples is HET (probably caused by a homopolymer sequencing mistake).
Sometimes there are genotypes for the HiFi samples but not for the Illumina samples. Probably regions were no Illumina reads can be aligned with confidence.
Sometimes there are Illumina genotypes but no HiFi genotypes. While there is data in the BAM files of the HiFi samples for these regions. Maybe a bug in the bcbio per sample determination of callable regions? Or in maybe in GATK variant calling?

I commented on a ticket at GATK to ask them about support and best practices for variant calling HiFi BAM files: https://gatk.broadinstitute.org/hc/en-us/community/posts/360072716972-Variant-calling-with-PacBio-HiFi-reads

So the good news is that HiFi reads are already compatible with bcbio and joint analysis with Illumina reads.

Performance can probably be improved by using minimap2 instead of bwa-mem for alignment. I did not try this because I don't know if the default minimap2 settings in bcbio are correct for HiFi, or how to change these.

And genotype concordance analysis probably still needs to be done against a truth set. This to iteratively improve the HiFi variant calling in bcbio/GATK.
This probably makes most sense against a human truth variant set?

Which there maybe is for Homo sapiens – GIAB sample HG002: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA586863

I think you already do genotype concordance testing in bcbio to make sure that the quality of the bcbio pipeline is good. Maybe you could include PacBio HiFi reads in the bcbio quality assurance / testing framework? If there are HiFi reads for your truth data. Or the other way around, if there is truth data for for example GIAB sample HG002?

I hope this information is useful.

naumenko-sa commented 3 years ago

HI @WimSpee !

Thanks for facilitating the discussion and providing so many interesting details on the topic! Yes, there is truth variants for Ashkenazim Son - HG002:

Sergey

bcbio / bcbio-nextgen

Long read sequencing alignment and (joint) variant calling #3282