bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
986 stars 354 forks source link

wish: HLA typing #178

Closed tanglingfung closed 7 years ago

tanglingfung commented 10 years ago

it may be a good addition http://www.bcgsc.ca/platform/bioinfo/software/hlaminer

chapmanb commented 10 years ago

Paul; Good timing. @heuermh just asked about approaches for this in an e-mail. I'd definitely like to include something to cleanly handle HLA regions and am open to anything folks have good experience with.

tanglingfung commented 10 years ago

@heuermh has the need for a while

tanglingfung commented 10 years ago

here is another caller; but the software is only available on request and for academic use http://nar.oxfordjournals.org/content/41/14/e142.long

another paper from omixon describes their method and provides a datasets from 1KG for testing http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0078410 https://s3.amazonaws.com/omixon-publication/hapmap_hla/HapMap_1KG_HLA_suppl_filtered_reads.tgz

tanglingfung commented 10 years ago

after some explorations, it seems that HLAminer is not straightforward to install or get it running and ATHLATES' licensing makes it a non-starter

tanglingfung commented 10 years ago

I just heard some good feedback from omixon's result today (I didn't try it myself)

chapmanb commented 9 years ago

Paul and all; We're revisiting this thanks to GRCh38 (#817), the 1000 genomes distribution and Heng Li's HLA typing work:

https://github.com/lh3/bwa/blob/master/README-alt.md#hla-typing

@bioinfo identified some great testing datasets to use for validation:

http://www.omixon.com/hla-typing-example-data/

with some backup papers to explore as well:

http://nar.oxfordjournals.org/content/early/2015/03/09/nar.gkv184.full http://genomemedicine.com/content/4/12/102

After GRCh38 support is in place we plan to evaluation and test the HLA support.

heuermh commented 9 years ago

Thanks for the update, @chapmanb. We've also seen Heng's bwakit work but he hasn't been receptive to pull requests thus far (two open since Jan 5 2015). We have evaluated the HLA calls against our own well-characterized QC samples and there is room for improvement; not sure what to do at this point.

chapmanb commented 9 years ago

Michael -- we'd love to work with you on the validation and I'd fully lean on your experience here. Do you think the Omixon data is a good test set to start with? Are your QC samples publicly available to iterate on, or would you be able to test a bcbio/bwakit pipeline internally?

I think if we can put together a validation set and show improvements we could get these rolled into the bwakit approach. I'm sure the pull requests probably slipped -- I can't even imagine what Heng's inbox looks like.

Thanks again for helping with this. Excited to finally have a path forward for this.

ohofmann commented 8 years ago

Another framework to consider for cross-validation: http://www.biomedcentral.com/qc/1471-2164/16/S2/S7

roryk commented 8 years ago

Thanks everyone, we'll be starting work on doing this soon so any updated experiences would be great.

heuermh commented 8 years ago

@roryk let me know if you'd like to set up a chat or call or shared doc on this, I could invite some of my former colleagues

schelhorn commented 8 years ago

We'd also love to see this implemented, +1.

roryk commented 8 years ago

Hi @heuermh,

That would be awesome. Let me know what I can do to help make that happen.

chapmanb commented 8 years ago

As a way to get started, I put together a validation of bwakit hg38 HLA calls for the Omixon test data. bwakit does a good job on ~40x exome data from 1000 genomes, but fails entirely on the higher depth (2000x) targeted data:

https://gist.github.com/chapmanb/8e2a18c7bbbee3167395

It's a promising starting point and we can iterate on this. If you want to test on any local data, you'd need to align against hg38 and use the undocumented parameter: hlacaller: bwakit. Happy for any feedback and suggestions to help improve the calls. Thanks much.

chapmanb commented 8 years ago

I added support in the latest bcbio development for HLA calling with OptiType (https://github.com/FRED-2/OptiType), using hlacaller: optitype with hg38 alignment. OptiType does a great job on both the exome and high depth targeted data. It only differs from the truth set on NA18964 calls but otherwise is spot on with expected:

https://gist.github.com/chapmanb/8f994618a7fc5e88f893

I'm excited to have this in place and will use OptiType going forward and will work on documenting it for the next release.

roryk commented 8 years ago

This is awesome.

schelhorn commented 8 years ago

Great job, Brad! Which aligner is used for OptiType then - does it do an extra RazerS3 alignment on its own or does it use bwakit?

chapmanb commented 8 years ago

Rory and Sven-Eric; Thanks much. This does a two step approach:

The first selection pass helps avoid a memory intensive RazerS3 step on the second pass, since we've already isolated to only HLA reads. It's nice to see this working cleanly with the very high depth targeted data as well.

schelhorn commented 8 years ago

Sweet; does using the full hg38 with alts work cleanly in the variant2+HLA typing+SV calling workflow then, including the Gemini annotation? Some annotation will be based on lift-over data, of course, but I wasn't sure if this is fully supported by both Bcbio and Gemini yet. Is it?

chapmanb commented 8 years ago

Sven-Eric; The full variant pipeline with hg38 is almost completely there. You can do HLA, variant calling and SV calling. The few things that are missing:

So we're nearly there with 38 support and will be continuing to try and fill in the missing gaps.

heuermh commented 8 years ago

@chapmanb were you able to work around the non-commercial license for CPLEX, one of the OptiType dependencies? Looks like other ILP solvers would work. In any case, the installation process is longer than I have patience for, so I look forward to taking advantage of what you've done. :)

The results are pretty good for two field resolution. Changing the reported and expected alleles to use GL String format may help clarify the ambiguity; for example I don't believe Omixon is reporting that sample NA18526 has three copies of HLA-B, rather they were unable to resolve the allelic ambiguity between HLA-B*58:01 and HLA-B*58:02 for one of the two copies of HLA-B. This row could be represented by HLA-B*40:01+HLA-B*58:01 and HLA-B*40:01+HLA-B*58:01/HLA-B*58:02, and as such, I would say the OptiType typing validates.

chapmanb commented 8 years ago

Michael; Thanks much. This uses GLPK so we don't need any commercial-restricted components. All of the installation is now packaged using conda so you can do:

conda install -c bioconda optitype

to get it installed along with the external dependencies like GLPK and RazerS3.

For typing, I agree on NA18526 and think OptiType is fine here. I could adjust the truth set from Omixon to reflect this. The only calls that do not appear to validate are A and C for NA18964.

lpantano commented 7 years ago

Hi

I am closing this because it seems an old issue and it seems this is now part of the pipeline if needed. Come back if you find other issues or want to continue with this one.

cheers