Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style
https://bioconductor.org/packages/GenomeInfoDb
31 stars 13 forks source link

Proposed contribution task for Outreachy applicants: Link canFam6 (UCSC genome) to Dog10K_Boxer_Tasha (NCBI assembly) #45

Closed hpages closed 2 years ago

hpages commented 2 years ago

This task depends on issues #43 and #44 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the map.NCBI argument of the getChromInfoFromUCSC() function. Try getChromInfoFromUCSC("hg19", map.NCBI=TRUE). See what happens? Now try getChromInfoFromUCSC("canFam6", map.NCBI=TRUE). See what the problem is? Check the documentation of the map.NCBI argument in ?getChromInfoFromUCSC to learn more about what this argument does.

Linking a UCSC genome to its NCBI assembly is done by defining an NCBI_LINKER object in the registration file for the UCSC genome (canFam6.R in this case). There's some very succinct information about what NCBI_LINKER should look like in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/. Don't hesitate to look at other registration files to see examples of how NCBI_LINKER is defined.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Priceless-P commented 2 years ago

@hpages I’d love to work on this issue as well. Please assign it to me.

I thought I commented on the wrong issue earlier. That’s why I deleted my first comment.

hpages commented 2 years ago

Done.

Again, very little information is provided in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/ about how to define NCBI_LINKER, I'm sorry. I'm going to try to improve that.

Priceless-P commented 2 years ago

It's no problem. I think I figured it out. Please take a look at my PR and tell me if I did it correctly.

hpages commented 2 years ago

Excellent. You nailed it again! PR #54 merged.

In this case it looks like there's a clean mapping between the sequences in canFam6 and those in Dog10K_Boxer_Tasha. This keeps NCBI_LINKER relatively simple. Some mappings are more tricky and can require a lot of try/fail/fix cycles before getting NCBI_LINKER right. Look for example at NCBI_LINKER in hg18.R. The mapping between the sequences in hg18 and those in the associated NCBI assembly NCBI36 is tricky. Some sequences in the former are not even mapped to sequences in the latter!

Next task in your group is issue #55. Whenever you are ready, go there and ask me to assign you.

Also don't forget to record your contributions on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/

Priceless-P commented 2 years ago

Now that you mentioned it, I checked it out. And it’s definitely more complex than compared to canFam6

I see that chr22_h2_hap1 and chr6_cox_hap2.

Perhaps we could fix that 🤔

hpages commented 2 years ago

I don't know. Last time I checked (this was a long time ago), it didn't seem that the chr5_h2_hap1 and chr6_qbl_hap2 sequences in hg18 could be mapped to sequences in NCBI36. In this context, "mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.

The lengths of those sequences are:

library(GenomeInfoDb)
hg18_chrominfo <- getChromInfoFromUCSC("hg18")
colnames(hg18_chrominfo)
# [1] "chrom"     "size"      "assembled" "circular" 
subset(hg18_chrominfo, chrom %in% c("chr5_h2_hap1", "chr6_qbl_hap2"))
#            chrom    size assembled circular
# 26  chr5_h2_hap1 1794870     FALSE    FALSE
# 28 chr6_qbl_hap2 4565931     FALSE    FALSE

But no sequences in NCBI36 have these lengths:

NCBI36_chrominfo <- getChromInfoFromNCBI("NCBI36")
colnames(NCBI36_chrominfo)
#  [1] "SequenceName"     "SequenceRole"     "AssignedMolecule" "GenBankAccn"     
#  [5] "Relationship"     "RefSeqAccn"       "AssemblyUnit"     "SequenceLength"  
#  [9] "UCSCStyleName"    "circular"        
c(1794870, 4565931) %in% NCBI36_chrominfo$SequenceLength
# [1] FALSE FALSE

So here you go: there's no sequence in the NCBI36 assembly that corresponds to the chr5_h2_hap1 or chr6_qbl_hap2 sequence in hg18. It seems to me that the UCSC people based hg18 on NCBI36, but decided to add those 2 sequences to it (which they took from somewhere else).

All this to say that I don't think there's much we can do about it. The mapping of hg18 to NCBI36 is what it is, ... messy! :man_shrugging:

Priceless-P commented 2 years ago

"Mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.

Oh. I misunderstood that.

All this to say that I don't think there's much we can do about it. The mapping of hg18 to NCBI36 is what it is, ... messy! 🤷‍♂️

Okay😊

Priceless-P commented 2 years ago

I also wanted to add that I’m more than willing to work on any issue at all. Just assign me and I’ll get to work immediately. 😊

hpages commented 2 years ago

Thanks for the offer! I will think about finding real issues for you, that is, issues opened by Bioconductor users, in addition to the scripted issues that I specifically prepared for the Outreachy contribution period.