bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
984 stars 353 forks source link

Support GRCh38 #817

Closed schelhorn closed 6 years ago

schelhorn commented 9 years ago

I was wondering if there are any plans to support GRCh38 as an additional assembly. The alternate loci and decoy sequences are likely to improve both variant calling and expression studies. I suppose that the availability of genomes and annotations are not much of a problem anymore, but not all tools may be compatible. Could someone detail the major roadblocks concerning this issue? Thanks.

chapmanb commented 9 years ago

Sven-Eric; We'd love to support GRCh38 and I agree that it's a much improved build. The only major blocker is that we don't currently have any reference standards to validate against so any update would be blind to the performance of the callers or our implementation. The Genome in a Bottle consortium is currently working on this so hopefully this is only a short term issue.

Beyond that, it's a matter of putting together the time and resources to work on this. The major steps we need to do are:

We're hopeful to get a project that needs/asks for GRCh38 we could use to fund the work on this, or happy to work along slowly as we have time to add it in. Thanks for the discussion.

schelhorn commented 9 years ago

Great, thanks for the comprehensive answer. It's good to be conservative about references and I for one am happy to wait for validation standards and GEMINI support if helps us to secure the validity of the results. Please feel free to close this issue. Perhaps it could make sense to copy the answer to the documentation in case anyone else is looking for it?

chapmanb commented 9 years ago

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

arq5x commented 9 years ago

We haven’t yet started coordinating annotations for GRCh38, but this is a very useful collection. My big concern is how to port datasets such as ESP and ExAC, as they will almost certainly not be recalled on GRCh38!

Aaron

On Apr 14, 2015, at 2:12 PM, Brad Chapman notifications@github.com wrote:

Sven-Eric, definitely want to keep this moving even if slowly. To get the process going here are some notes on GRCh38 resources. We can continue to collect them here and merge into the automated data installation. ccing @arq5x and @brentp in case you've started working on resource collection/installation for GEMINI already as I'd love to coordinate with you:

FASTA file with decoy, ready for bwa-mem, including bwa and bowtie indices: ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

With alternatives: GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna No alternatives alleles: GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

dbSNP for GRCh38: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

Full: 00-All.vcf.gz

Annotations from Ensembl: ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/

release79: Homo_sapiens.GRCh38.79.gtf.gz Needs to be converted to use chr style prefixes instead of bare to maintain consistency with NCBI and UCSC.

COSMIC (licensing restrictions with v72): sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz sftp://sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/VCF/CosmicNonCodingVariants.vcf.gz

ClinVar: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

— Reply to this email directly or view it on GitHub.

chapmanb commented 9 years ago

Thanks Aaron -- we're going to have to hold our noses and get familiar with liftOver and friends, at least in the medium term. It looks like there are 3 different tools:

and some relevant discussion:

https://www.biostars.org/p/65558/ https://www.biostars.org/p/113184/

Does anyone have good/bad experience with any of these?

chapmanb commented 9 years ago

From @druvus -- notes from SciLife's work on building a GATK bundle for 38:

https://wabi-wiki.scilifelab.se/display/SHGATG/gatk+bundle+in+hg38

mw55309 commented 9 years ago

Hi Guys

Bringing in Alison Meynert (@ameynert) who is one of our medical genomics experts in Edinburgh. She has been working on lift over of GRCh37 GATK resources to GRCh38.

Cheers Mick

ameynert commented 9 years ago

I've been running CrossMap, and see similar results to what the SciLife post reported. Nearly all sites can be mapped to GRCh38 for all the GATK resources with the Ensembl chain file:

0.9983 1000G_omni2.5.b37.vcf 0.9974 1000G_phase1.indels.b37.vcf 0.9986 1000G_phase1.snps.high_confidence.b37.vcf 0.9961 dbsnp142.b37.vcf 0.9986 hapmap_3.3.b37.vcf 0.9976 Mills_and_1000G_gold_standard.indels.b37.vcf

I'll take a look at the breakdown of missing vs. incorrectly mapped for the dbSNP benchmark, and run dbSNP138 as well as 142.

arq5x commented 9 years ago

Very interesting, thanks! I expect that this approach will work equally as well for ESP, Clinvar, and the ExAC dataset. However, I wonder how appropriate lift over is for annotations such as CADD and fitCons where local genomic context factors into the scores assigned. In other words, the new location is likely not in doubt, but the question is whether the score calculated for b37 is appropriate for b38. I suspect tht there are many cases where this is not the case.

chapmanb commented 9 years ago

Alison and Mick -- thanks so much for investigating the CrossMap approach. Those numbers looks great, I'd be happy to include the GATK supporting files into bcbio so we have them available. We can always iterate and update as we continue to evaluate these.

Practically, I added in downloads of hg38/GRCh38 (without alternative sequences) to bcbio. If you update to the latest version of the development code you can do:

bcbio_nextgen.py upgrade -u development
bcbio_nextgen.py upgrade --data --genomes hg38-noalt

and it will install the reference genome, bwa/bowtie2 indices, and dbSNP 142. The references and indices all come from NCBI:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS

I haven't done any runs with the hg38 genome but having the basic data installation in place is step one.

Aaron, on the logistics side I used your GGD approach to manage the installation. The separation between the download/prep code and the machinery to actually do the installations is nice -- it makes the data provenance clear. I needed to stray from your initial GGD implementation of streaming the results of one download to one file since these preps involved picking and prepping multiple files from the initial tarballs. I implemented a take on GGD which uses a temporary directory, runs all of the file prep commands as a bash script, then pulls out the final files into the output directory when updated:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/ggd.py

I'd love to try and coordinate this with what you want to do for GEMINI so we can have a single approach to get one set of downloaded data.

Thanks all for the discussion and ideas.

druvus commented 9 years ago

Hi, just want to bring in Pall Olason (@pallolason) who did the liftover work at SciLifeLabs into the discussion.

cheers, Andreas

ameynert commented 9 years ago

The GATK bundle file of dbSNP138 excluding 1000 Genomes samples should also get lifted over for use with VQSR. There are issues with tranche calculations caused by hard-coded expectations of pre-1KG data. See the last comment on the GATK forum here: http://gatkforums.broadinstitute.org/discussion/5221/tranches-plot-issue

The presenters at this week's GATK roadshow said they were hoping to have this fixed in release 3.4.

pallolason commented 9 years ago

Thanks for the heads up @druvus. @ameynert , did the Ensembl chain files work for you out of the box with CrossMap? When I use them to map vcf files, I have to edit the "target" genome chromosome names to "chrN" style. Otherwise CrossMap will not fetch the REF allele from the fasta file. I reproduced that with the CrossMap tool on Ensembls assembly converter page: http://www.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core

ameynert commented 9 years ago

@pallolason The chain files worked out of the box for coordinate mapping only, but not for the reference bases, as you found. I'm trying the UCSC hg19 to hg38 chain files to see how those compare to hacking the Ensembl chain files to use 'chrN' style coordinates. It's strange that CrossMap has this limitation since the VCF files I'm converting are all in 'N' chromosome style.

chapmanb commented 9 years ago

Alison -- for GATK VQSR plotting compatible dbSNPs, we could subset the existing NCBI GRCh38 dbSNP VCF. It has dbSNPBuildID for the introduced version, plus KGPhase1 and KGPhase3 to indicate 1000genomes versions. That'll help us create a subset file directly from the already lifted over version instead of doing it twice. Since it's just a plotting issue it might be worth waiting until 3.4 and seeing what the GATK folks recommend before doing this.

When you think the lifted over supplemental files are ready I'll be happy to include them in the bcbio distribution. Just let me know.

For the full GRCh38 build plus alts, we'll plan to use the distribution from 1000 genomes:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

This is the NCBI full build plus additional HLAs from @lh3 which will hopefully get us HLA typing:

https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing

pallolason commented 9 years ago

Thanks @ameynert. The hg19tohg38 chain file from UCSC works fine for me, comparable numbers as with the Ensembl chain file after realizing that my initial benchmarks had a sorting glitch (intersectBed -sorted expects lexically sorted chromosomes and I used vcf-sort -c which numerically sorts the chromosomes). Note that once lifted over, the vcf files may contain entries where the ref and (one of the) alt alleles are the same and GATK does not like those: http://gatkforums.broadinstitute.org/discussion/4489/when-the-bundle-of-supportive-files-are-going-to-be-available-in-ftp-site-for-hg38-release.

chapmanb commented 9 years ago

Hi all; We pushed a new release (0.8.9) with initial support for build 38 for variant calling. There are two builds hg38 (the 1000 genomes distribution with all the NCBI alts + HLAs, prepared by Heng Li) and hg38-noalt (like hg37, but with the new build coordinates). Both will run a standard variant calling pipeline with annotations. The additional work we still need to do includes:

Please let us know if you run into any issues with 38 as is and thanks for all the help and support on getting it this far.

ameynert commented 9 years ago

I've been testing a standard BWA+GATK alignment pipeline with GRCh38 resources on exome samples using:

Test runs on truncated FASTQ files are fine. I'm running 55 high depth samples through over the weekend and will post Monday on the results. The same set of samples were run on GRCh37 using otherwise identical parameters earlier in the week, so I should be able to make some useful comparisons.

@chapmanb Filtering down the existing dbSNP142 to exclude the post-129/1KG sites is a good plan. GATK 3.4 is out today. Their blog says "Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv)." (https://www.broadinstitute.org/gatk/blog?id=5562), which is a little unclear, but that could mean the issue has been addressed. Unfortunately I don't have time to test this myself just now - I'm going to be mostly offline for the next 4 weeks - but I will certainly have a look when I get back.

chapmanb commented 9 years ago

Alison; Brilliant, thank you for testing and preparing the VQSR supplemental files. Are those publicly available so we can incorporate them into bcbio? I'm happy to host on S3. With those, I'll test VQSR and the tranche plots in 3.4 to see if we can avoid needing to support multiple dbSNPs. Thanks again for all the work and hope you have a relaxing time off.

ameynert commented 9 years ago

I'll post the files once I've confirmed the high depth exome runs are ok, and put a link in this thread.

I used this file for dbSNP142: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b142_GRCh38/VCF/All_20150416.vcf.gz

ameynert commented 9 years ago

I've done a test of SNP and indel VQSRs with my GRCh38 GATK bundle files on four sets of lifted over VCFs, each with a few hundred exomes. No crashes, the results look fine, and the tranche stats are very close to those from the original GRCh37 VCFs. I've uploaded the files to my university's FTP server and sent the links to Brad so he can put them up and everyone else can have a crack at testing.

mw55309 commented 9 years ago

Hi

I want to bring in @lauraclarke who has been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on liftover from GRCh37

Cheers Mick

lauraclarke commented 9 years ago

Hi Mick, Alison

I am more than happy to chat about what we are doing to remap the 1000 Genomes data.

Do let me know what issues you want to talk about.

thanks

Laura

On 19 May 2015 at 08:55, mw55309 notifications@github.com wrote:

Hi

I want to bring in @lauraclarke https://github.com/lauraclarke who has been gathering GRCh38 resources for the 1KG project at EBI. Hopefully we can work together!

Laura: Alison Meynert from IGMM here in Edinburgh has been working on liftover from GRCh37

Cheers Mick

Reply to this email directly or view it on GitHub https://github.com/chapmanb/bcbio-nextgen/issues/817#issuecomment-103387916 .

chapmanb commented 9 years ago

Alison -- thanks so much for preparing the Broad bundle VQSR resource files. I uploaded these to an S3 bucket and added downloads for bcbio, so doing a bcbio_nextgen.py upgrade --data will prepare these to be ready to go.

Laura, it would be great to have you involved. Our goal has been to collect and create resources for 38 so we can start having the annotations and analyses we're used to on 37. We have a simple recipe based approach taken from what Aaron Quinlan put together with the current resources we've collected:

https://github.com/chapmanb/cloudbiolinux/tree/master/ggd-recipes/hg38

From my perspective some useful points of collaboration would be:

Thank you all again for the work so far. Great to see this moving along.

lauraclarke commented 9 years ago

Hi Brad

This looks great, I would like to involve Holly from my team in this conversation but I don't seem to be able to tag her. Not sure if I am doing something wrong (her git user profile is https://github.com/HollyZB)

So we are in the process of remapping 1000Genomes data to GRCh38.

Our plan is to use the newest version of bwa with alt aware mapping. I see you are already pointing to our reference files

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/

There is also a version of dbSNP 142 on GRCh38 coords and the phase 3 indels (much better than the phase 1 indels). We don't have the devine/mills variants, We might have to pinch them from you as we haven't started quite yet. Are there instructions on how we might use those yaml files to actually download the data anywhere, we haven't used ggv yet

One open question is how different mapping to just the primary ref is vesus primary plus alts, decoy and hla. We are starting to build data for this. We have the first few alignments of the high coverage PCR free 2*250 samples that we have aligned both ways.

e.g

laura@pg-trace-001[ftp]:grep GRCh38 current.tree | grep cram | grep HG00096 | cut -f1 ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram.crai ftp/technical/working/20150511_IGSR_highcov_cram_no_alt/HG00096/high_cov_alignment/HG00096.bwamem_GRCh38primary.20150424.high_coverage.bam.cram ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram.crai ftp/data/HG00096/high_cov_alignment/HG00096.alt_bwamem_GRCh38DH.20150424.high_coverage.bam.cram

We want to compare between the two mapping sets. If you already have any tools to do this or just suggestions on what the best metrics to compare the files on, we would be happy to hear them

thanks

Laura

lauraclarke commented 9 years ago

I also realised I should say, if there is a good set of GRCh38 mapping resources, we are more than happy to host it on the 1000G ftp site, it is a natural place people come to look for such things anyway

chapmanb commented 9 years ago

Laura -- thanks so much. You can notify/subscribe Holly to the discussion with an @ in front of her username (@HollyZB), and she'll get all future messages in the thread.

I appreciate all the help coordinating these files. It's great to be working on this as a community. Here's a pointer to all the files Alison prepared for direct downloads:

http://biodata.s3-website-us-east-1.amazonaws.com/hg38_bundle/

It would be brilliant to host these on the 1000G ftp site, if you're willing. One practical thing we've been trying to do is keep the chromosome naming to one consistent approach so we don't have the continued naming issues we have with GRCh37/hg19 (1 versus chr1). We're using the chr-prefixed style you also have in your reference distribution, so hopefully all is compatible.

For the VQSR resources you've prepared, do you have experience on how these truth inputs compare to the sets Broad make available that Alison lifted over? We're open to use anything that is tested and validated, so would definitely lean on your expertise.

For validation in general, you've probably already seen Heng's hapdip validation:

https://github.com/lh3/bwa/blob/master/README-alt.md#preliminary-evaluation

We're planning to also do a full vs noalt comparison using NA12878, Platinum Genomes fastqs, and the Genome in a Bottle reference. Deanna Church is using remap to convert the 37 Genome in a Bottle reference calls to 38 and we'll run validations against those (in the style of http://bcb.io/2014/10/07/joint-calling/). Happy to coordinate on any and all of this.

Thanks again for all the discussion and work on 38 support.

HollyZB commented 9 years ago

Thanks Brad.

I downloaded Mills' indels lifted by Alison. It passed our standard validation and yes indeed it needs "chr" in the start of all data lines. I will put it up at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources

schelhorn commented 9 years ago

I guess this thread shows why bcbio is a community-developed pipeline -- awesome! I have two tidbits to add, both of which may be resources that are well known but I still wanted to mention them here for reference:

The Seattle Seq Annotation Server has lifted over to hg38 PolyPhen, RepeatMasker/TandemRepeat, KEGG, UCSC CpG Islands, ESP, ExAC, GERP, CADD, and three HapMap populations (African, Asian, European).

Annovar has lifted hg38 resources for ljb26 (whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, PhyloP and SiPhy scores from dbNSFP), cosmic v70, esp6500, ExAC, 1kG (October 2014), NCI60 and ClinVar.

In both cases I do not have more information than the statements referring to hg38 on the linked websites, but I am sure the developers of Seattle Seq and ANNOVAR would be open to collaboration if someone would contact them. In that manner development of high-quality canonical resources could be furthered by reconciling duplicated work earlier rather than later.

lauraclarke commented 9 years ago

On existing snp resources, the VEP has been available on GRCh38 since last year and Ensembl v80 which was released yesterday has all the 1000G data but chrX, Y and a small piece of chr12 as they werent in dbSNP 142.

schelhorn commented 9 years ago

So I gather that most resources for GEMINI-based hg38 variant annotation should be available already, with possible exclusions of OMIM and HPRD (or are they available via UCSC tracks)? For CADD and FitCons (the latter is included in CADD 1.2) the authors may have to re-score on the new assembly due to the genomic context issue @arq5x mentioned. I can ask Martin Kircher of CADD to chime in and give an opinion, we used to be in the same group.

schelhorn commented 9 years ago

Martin kindly answered right away and I am posting his position concerning CADD liftover here:

CADD depends on several tracks that are not yet released for GRCh38 or which we would need to recalculate (e.g. conservation metrics excluding the human reference but on human coordinates [PhastCons, PhyloP], GERP and several ENCODE super tracks). Especially for the ENCODE information, we are considering to switch to the Ensembl Regulatory Build which has been build for both genome releases now. Overall, Ensembl seems to be a great resource for GRCh38 annotations. While we would like to release a GRCh38 version of CADD in 2015, we have not actually started this effort yet. At the moment, lifting scores is the only option, but it does not give you any benefit from using the new genome build (i.e. additional >3MB of genome sequence, fixed reference errors, removed minority alleles, alignment errors in annotations derived from GRCh37). If you liftOver a score set, I would recommend the last major release (1.0) rather than the developmental ones that are more experimental. We are also aware of a problem with overlapping gene annotation in 1.1/1.2. We are still optimizing training parameters and hope to have 1.3 out in the not so far future. Regarding fitCons, I would not recommend using the version that is available in the CADD annotation files. This is not the latest available version. Also note that CADD is not using fitCons, we only included it to make comparisons easier for us.

chapmanb commented 9 years ago

Nice Reddit discussion about pros and cons of moving to build 38:

http://www.reddit.com/r/genome/comments/3b3s3t/switch_from_hg19build37_to_hg20build38/

with a pointer to this effort by Shuoguo Wang to remap the Broad bundle:

https://github.com/iiiir/GRCH38_gatk_bundle

schelhorn commented 8 years ago

So the impossible just happened:

@dgmacarthur: OK, @KMS_Meltzy has convinced me. Firing up the terminal to remap ExAC onto GRCh38 as we speak. #gi2015

mw55309 commented 8 years ago

I think he may have been being sarcastic ;-)

dgmacarthur commented 8 years ago

Yes, that was sarcasm. One day we'll switch. Not today.


Sent from my phone.

On Oct 29, 2015, at 6:55, mw55309 notifications@github.com wrote:

I think he may have been being sarcastic ;-)

— Reply to this email directly or view it on GitHub.

schelhorn commented 8 years ago

Sad panda 🐼. Well, perhaps next year then. @dgmacarthur should start a kickstarter for financing the CPU cycles. I'd pledge the re-analysis of 100 exomes...

roryk commented 8 years ago

I've added hisat2 support so we can support this for RNA-seq but we've been having some issues with it so it isn't live yet. Transcriptome-only methods like Sailfish shouldn't have these same issues so we should be able to support this for RNA-seq officially pretty soon.

roryk commented 8 years ago

hisat2 seems to work well for handling RNA-seq alignments, and I ran the SEQC dataset through hg38 and hg38-noalt and it seems to do well quantitating hg38 with the alts, so we're good to go for that.

schelhorn commented 6 years ago

Thanks for all the insightful comments; since hg38 support in bcbio has since been completed for both DNA-Seq variant calling/annotation and RNA-Seq expression quantitation I am now closing this issue.

Please feel free to open a new bcbio issue in case you have an interest in robust and fully-featured pipelines on genome graphs, which represent one of the important next steps in high-volume NGS processing that we should consider taking.

In particular, the bcbio devs would be interested in obtaining publicly available test datasets (for instance for NA12878) to showcase the advantages of/benchmark genome graphs compared to flat references with the currently available toolchains.