berman-lab / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
6 stars 6 forks source link

Add Assembly 22 as a reference genome #7

Open vladimirg opened 9 years ago

vladimirg commented 9 years ago

We should also consider a general mechanism for uploading assemblies with their hapmaps (which is they way that Assembly 22 is delivered, and presumably every future Assembly as well).

darrenabbey commented 9 years ago

Some specific issues involved in this are: 1) User specification or automatic determination of homolog pairs. 2) Automatic alignment of homolog pair sequences. * There are significant sequence differences between homologs, sufficient that a simple per-base comparison would introduce lots of errors. 3) Automatic inference of haplotype map based on differences between aligned homolog sequences.

vladimirg commented 8 years ago

@darrenabbey , now that I've learned a little more about sequence alignments, this does seem both like a pain and not very useful.

Technically, one idea to generate the automatic hapmaps is to use haplotype A as the reference, simulate a FASTQ made up form haplotype B, SNP call that, and use that as a hapmap. This should be reasonable in high-complexity regions, right?

But even if this works, I'm wondering now if it's really useful. Since Ymap isn't expected to generate accurate SNP calls, but only to show major genomic events, any nucleotide-level inaccuracies in the reference are moot as long as the alignments are OK the hapmap is correct, and both withstood the test of time for A21.

So basically it seems like adding A22 won't be of much benefit. And I'm not sure if generating hapmaps from two given haplotypes is useful (at least we've never had such requests before), at least at the moment. So I'm considering closing this issue. What do you think?

darrenabbey commented 8 years ago

The main reason I see that it might be useful to allow input of diploid references genomes is that such genomes are likely to become more common as time goes by. Importing A22 would be more useful as an example than as an upgrade to A21 (within YMAP for the reasons you describe). This feature would prevent users from having to manually reconstruct a haploid reference from any diploid reference they may be working with.

Your hypothetical approach to generate the hapmap from a diploid reference sounds quite reasonable and should be relatively simple to implement because of a feature of YMAP's hapmap-generation tool that you might not have used. If you have sequence datasets from two haploid strains, YMAP can construct a hapmap in one step by identifying the differences between the datasets. I implemented this feature after a conversation with someone at a conference (whose name I have forgotten) who had sequenced haploid progeny after mating two unrelated haploid parents. She was trying to figure out the contribution of each parental strain into each of the progeny strains.

vladimirg commented 8 years ago

Alright, keeping it open for future reference.

That someone wouldn't be Jane Usher, perchance? We did use this feature for her upcoming paper, it's pretty cool.

darrenabbey commented 8 years ago

Yes, it was! That feature was written precisely for/[because of] that project! (Though, it wasn't something we had agreed I would do as a collaboration.) I was certain it would come in handy to have once the idea came up. I had wanted to continue the conversation with her regarding the analysis, but other things (thesis, life, etc.) ended up taking my time.