chapmanb / cloudbiolinux

CloudBioLinux: configure virtual (or real) machines with tools for biological analyses
http://cloudbiolinux.org
MIT License
257 stars 158 forks source link

Updates to hg38 reference #258

Closed matthdsm closed 7 years ago

matthdsm commented 7 years ago

Hi Brad,

What's your vision on updating the hg38 reference fasta to include the latest patches? Lately, we've run into some alt coordinates that aren't included in the current available reference, but are included in the latest refseq releases.

Cheers M

chapmanb commented 7 years ago

Matthias; Thanks for bringing this up. Updating the reference genome is always tricky but we can try to do if there is a compelling case to make it worthwhile and the process can be relatively painless for users. I'd prefer to rely on distributions like the 1000 genomes or Broad bundles that are widely used, since those often handle the edge cases and make it easier to communicate the exact distribution we're using. Do you know if any major projects have incorporated the patches we could work from?

matthdsm commented 7 years ago

Hi Brad,

The latest patch can be downloaded from the FTP of NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.26_GRCh38.p11) with a new patch scheduled for winter this year. Is this something you can work from? Indexes can be generated during installation is suppose. This will create some overhead for the users during installation, but it's not something that will need to be done frequently. Alternatively, indices can be hosted on your biodata S3 bucket of you think that to be a better option.

Cheers M

matthdsm commented 7 years ago

Of course, letting the users generate their own indices does have some benefits

A quick skim of the ggd recipes show that it's only the bwa and fasta indices that should be generated, so I don't think it's really all that bad.

M

chapmanb commented 7 years ago

Matthias; Thanks for the details. The major challenge with updating reference builds aren't preparing the resources on our side, it's that it can cause incompatibilities with previous pipeline runs. Some tools are careful about sequence dictionaries in BAM and VCF files and have issues when introducing additional contigs/haplotypes. To avoid causing burden on users, we'd update infrequently and hopefully only when a build provider like Broad or 1000 genomes incorporates the new contigs.

What is your motivation for including the patches? Are they tackling some problems you're finding with the current build? It would help to understand the cost/benefit tradeoffs as we debate about upgrading. Thanks again for the discussion.

matthdsm commented 7 years ago

Hi Brad,

I understand updating the reference could/will cause some problems for older runs. Frankly, we'd have the same issues. The case for an update is that the 1KG and Broad packages are already quite old, and the latest patches include some some clinically significant updates for several genes (cfr. http://genomeref.blogspot.be/2017/09/grch38p11-clinically-relevant-updates.html). Since we're working in a clinical environment here, this may become important in the future. Currently we're already running into some limitations during some downstream coverage analysis where we use the latest refseq regions (which do included the new patches). We have a workaround in place, and haven't really spotted any major issues, but it'd be nice to know we're working with up to date references.

Cheers M

chapmanb commented 7 years ago

Matthias; Thanks for the additional details. It does sound useful to include these and think of an approach for doing so. In your last paragraph it sounds like you're running into problems with use them though? Or am I misreading?

The biggest worry I have here is that I don't think bcbio is the right place to make human genome analysis distributions. I would want to explore this more if we could use a pre-prepared distribution somewhere. Ensembl and UCSC have some of this but it doesn't seem like they have anything you could download directly that includes things like the HLA contigs. Do you know of any distributions that are being updated with patches?

matthdsm commented 7 years ago

Hi Brad,

There aren't any issues with the latest patch (as far as I know anyway). The problem we're having is that we're using the variant_regions option with a bed file based on the latest refseq release. This causes some incompatibilities, because bcbio thinks we're asking for coordinates that are outside of their chromosome.

For example:

[2017-09-29T03:20Z] pbsnode-2: Unexpected error
Traceback (most recent call last):
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 50, in _setup_logging
    yield config
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipythontasks.py", line 235, in prep_samples
    return ipython.zip_args(apply(sample.prep_samples, *args))
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 185, in prep_samples
    data = clean_inputs(data)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 196, in clean_inputs
    clean_vr = clean_file(dd.get_variant_regions(data), data)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/bedutils.py", line 93, in clean_file
    check_bed_coords(in_file, data)
  File "/home/galaxy/bcbio/anaconda/lib/python2.7/site-packages/bcbio/variation/bedutils.py", line 74, in check_bed_coords
    (line, in_file))
ValueError: Found BED coordinate off the end of the chromosome:
chr12_GL383553v2_alt    152742  152894  NM_001286615.1_cds_4_0_chr12_GL383553v2_alt_152763_r|NM_001286616.1_cds_4_0_chr12_GL383553v2_alt_152763_r|NM_178826.3_cds_4_0_chr12_GL383553v2_alt_152763_r     0       -,-,-
/home/projects/bcbio_annotation/ref/variant_regions/RefSeqExomeAndPanels_20170927.bed
Is the input BED from the right genome build?

As for your worries about making bcbio a human genome distro, I understand. No reason for us to do something others already do for us. What are your requirements for a pre-prepared distribution? Do the resources provided by NCBI lack something?

Cheers M

chapmanb commented 7 years ago

Matthias; Thanks for the details. Are the coordinates of the alt contigs changing on patch releases? That makes this a lot harder. Apologies, I thought it was just adding new alternative contigs. As an example, different chrMs continue to be a problem for hg19 even now. There is really a lot of value in "hg38" meaning something specific and right that both the Broad and 1000 genomes distributions create that coordinate space. I wish there was more specificity around the exact build and patch to make this more clear.

The NCBI resource is something could be used as a base but doesn't HLAs so we'd need to merge and massage names to have the chr* style names we're trying to use consistently for hg38.

So nothing is impossible but there are a lot of technical and community challenges we should think carefully about. Thanks again for this discussion, there is a lot to digest and think about how to approach.

matthdsm commented 7 years ago

Hi Brad, It does seem some coordinates have changed. I propose we put this idea in the fridge for a while and perhaps reevaluate in six months or something. Hopefully Broad will have updated their resource bundle by then to go with GATK4. I think neither of us currently have the room to spend a significant amount of time on this.

Thanks a lot for the discussion. Cheers M

chapmanb commented 7 years ago

Matthias -- thanks again for the helpful discussion. It makes sense to keep looking at this but agree we should wait a bit to see how the resource bundles evolve over time and handle the coordinate changes. I'll close this for now and we can re-open or propose specific tasks when we want to tackle it.