Proposed contribution task for Outreachy applicants: Link canFam6 (UCSC genome) to Dog10K_Boxer_Tasha (NCBI assembly)

hpages commented 2 years ago

This task depends on issues #43 and #44 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the map.NCBI argument of the getChromInfoFromUCSC() function. Try getChromInfoFromUCSC("hg19", map.NCBI=TRUE). See what happens? Now try getChromInfoFromUCSC("canFam6", map.NCBI=TRUE). See what the problem is? Check the documentation of the map.NCBI argument in ?getChromInfoFromUCSC to learn more about what this argument does.

Linking a UCSC genome to its NCBI assembly is done by defining an NCBI_LINKER object in the registration file for the UCSC genome (canFam6.R in this case). There's some very succinct information about what NCBI_LINKER should look like in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/. Don't hesitate to look at other registration files to see examples of how NCBI_LINKER is defined.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

Priceless-P commented 2 years ago

@hpages I’d love to work on this issue as well. Please assign it to me.

I thought I commented on the wrong issue earlier. That’s why I deleted my first comment.

hpages commented 2 years ago

Done.

Again, very little information is provided in the README.TXT file located in GenomeInfoDb/inst/registered/UCSC_genomes/ about how to define NCBI_LINKER, I'm sorry. I'm going to try to improve that.

Priceless-P commented 2 years ago

It's no problem. I think I figured it out. Please take a look at my PR and tell me if I did it correctly.

hpages commented 2 years ago

Excellent. You nailed it again! PR #54 merged.

In this case it looks like there's a clean mapping between the sequences in canFam6 and those in Dog10K_Boxer_Tasha. This keeps NCBI_LINKER relatively simple. Some mappings are more tricky and can require a lot of try/fail/fix cycles before getting NCBI_LINKER right. Look for example at NCBI_LINKER in hg18.R. The mapping between the sequences in hg18 and those in the associated NCBI assembly NCBI36 is tricky. Some sequences in the former are not even mapped to sequences in the latter!

Next task in your group is issue #55. Whenever you are ready, go there and ask me to assign you.

Also don't forget to record your contributions on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/

Priceless-P commented 2 years ago

Now that you mentioned it, I checked it out. And it’s definitely more complex than compared to canFam6

I see that chr22_h2_hap1 and chr6_cox_hap2.

Perhaps we could fix that 🤔

hpages commented 2 years ago

I don't know. Last time I checked (this was a long time ago), it didn't seem that the chr5_h2_hap1 and chr6_qbl_hap2 sequences in hg18 could be mapped to sequences in NCBI36. In this context, "mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.

The lengths of those sequences are:

library(GenomeInfoDb)
hg18_chrominfo <- getChromInfoFromUCSC("hg18")
colnames(hg18_chrominfo)
# [1] "chrom"     "size"      "assembled" "circular" 
subset(hg18_chrominfo, chrom %in% c("chr5_h2_hap1", "chr6_qbl_hap2"))
#            chrom    size assembled circular
# 26  chr5_h2_hap1 1794870     FALSE    FALSE
# 28 chr6_qbl_hap2 4565931     FALSE    FALSE

But no sequences in NCBI36 have these lengths:

NCBI36_chrominfo <- getChromInfoFromNCBI("NCBI36")
colnames(NCBI36_chrominfo)
#  [1] "SequenceName"     "SequenceRole"     "AssignedMolecule" "GenBankAccn"     
#  [5] "Relationship"     "RefSeqAccn"       "AssemblyUnit"     "SequenceLength"  
#  [9] "UCSCStyleName"    "circular"        
c(1794870, 4565931) %in% NCBI36_chrominfo$SequenceLength
# [1] FALSE FALSE

So here you go: there's no sequence in the NCBI36 assembly that corresponds to the chr5_h2_hap1 or chr6_qbl_hap2 sequence in hg18. It seems to me that the UCSC people based hg18 on NCBI36, but decided to add those 2 sequences to it (which they took from somewhere else).

All this to say that I don't think there's much we can do about it. The mapping of hg18 to NCBI36 is what it is, ... messy! :man_shrugging:

Priceless-P commented 2 years ago

"Mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.

Oh. I misunderstood that.

All this to say that I don't think there's much we can do about it. The mapping of hg18 to NCBI36 is what it is, ... messy! 🤷‍♂️

Okay😊

Priceless-P commented 2 years ago

I also wanted to add that I’m more than willing to work on any issue at all. Just assign me and I’ll get to work immediately. 😊

hpages commented 2 years ago

Thanks for the offer! I will think about finding real issues for you, that is, issues opened by Bioconductor users, in addition to the scripted issues that I specifically prepared for the Outreachy contribution period.

Bioconductor / GenomeInfoDb

Proposed contribution task for Outreachy applicants: Link canFam6 (UCSC genome) to Dog10K_Boxer_Tasha (NCBI assembly) #45