Closed hpages closed 2 years ago
@hpages I’d love to work on this issue as well. Please assign it to me.
I thought I commented on the wrong issue earlier. That’s why I deleted my first comment.
Done.
Again, very little information is provided in the README.TXT
file located in GenomeInfoDb/inst/registered/UCSC_genomes/
about how to define NCBI_LINKER
, I'm sorry. I'm going to try to improve that.
It's no problem. I think I figured it out. Please take a look at my PR and tell me if I did it correctly.
Excellent. You nailed it again! PR #54 merged.
In this case it looks like there's a clean mapping between the sequences in canFam6
and those in Dog10K_Boxer_Tasha
. This keeps NCBI_LINKER
relatively simple. Some mappings are more tricky and can require a lot of try/fail/fix cycles before getting NCBI_LINKER
right. Look for example at NCBI_LINKER
in hg18.R
. The mapping between the sequences in hg18
and those in the associated NCBI assembly NCBI36
is tricky. Some sequences in the former are not even mapped to sequences in the latter!
Next task in your group is issue #55. Whenever you are ready, go there and ask me to assign you.
Also don't forget to record your contributions on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/
Now that you mentioned it, I checked it out. And it’s definitely more complex than compared to canFam6
I see that chr22_h2_hap1
and chr6_cox_hap2
.
Perhaps we could fix that 🤔
I don't know. Last time I checked (this was a long time ago), it didn't seem that the chr5_h2_hap1
and chr6_qbl_hap2
sequences in hg18
could be mapped to sequences in NCBI36
. In this context, "mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.
The lengths of those sequences are:
library(GenomeInfoDb)
hg18_chrominfo <- getChromInfoFromUCSC("hg18")
colnames(hg18_chrominfo)
# [1] "chrom" "size" "assembled" "circular"
subset(hg18_chrominfo, chrom %in% c("chr5_h2_hap1", "chr6_qbl_hap2"))
# chrom size assembled circular
# 26 chr5_h2_hap1 1794870 FALSE FALSE
# 28 chr6_qbl_hap2 4565931 FALSE FALSE
But no sequences in NCBI36
have these lengths:
NCBI36_chrominfo <- getChromInfoFromNCBI("NCBI36")
colnames(NCBI36_chrominfo)
# [1] "SequenceName" "SequenceRole" "AssignedMolecule" "GenBankAccn"
# [5] "Relationship" "RefSeqAccn" "AssemblyUnit" "SequenceLength"
# [9] "UCSCStyleName" "circular"
c(1794870, 4565931) %in% NCBI36_chrominfo$SequenceLength
# [1] FALSE FALSE
So here you go: there's no sequence in the NCBI36
assembly that corresponds to the chr5_h2_hap1
or chr6_qbl_hap2
sequence in hg18
. It seems to me that the UCSC people based hg18
on NCBI36
, but decided to add those 2 sequences to it (which they took from somewhere else).
All this to say that I don't think there's much we can do about it. The mapping of hg18
to NCBI36
is what it is, ... messy! :man_shrugging:
"Mapped" means that the DNA sequences are the same, only their names differ, so this implies that the sequence lengths are also the same.
Oh. I misunderstood that.
All this to say that I don't think there's much we can do about it. The mapping of hg18 to NCBI36 is what it is, ... messy! 🤷♂️
Okay😊
I also wanted to add that I’m more than willing to work on any issue at all. Just assign me and I’ll get to work immediately. 😊
Thanks for the offer! I will think about finding real issues for you, that is, issues opened by Bioconductor users, in addition to the scripted issues that I specifically prepared for the Outreachy contribution period.
This task depends on issues #43 and #44 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.
The purpose of "linking" a UCSC genome to the NCBI assembly that it is based on, is to support the
map.NCBI
argument of thegetChromInfoFromUCSC()
function. TrygetChromInfoFromUCSC("hg19", map.NCBI=TRUE)
. See what happens? Now trygetChromInfoFromUCSC("canFam6", map.NCBI=TRUE)
. See what the problem is? Check the documentation of themap.NCBI
argument in?getChromInfoFromUCSC
to learn more about what this argument does.Linking a UCSC genome to its NCBI assembly is done by defining an
NCBI_LINKER
object in the registration file for the UCSC genome (canFam6.R
in this case). There's some very succinct information about whatNCBI_LINKER
should look like in theREADME.TXT
file located inGenomeInfoDb/inst/registered/UCSC_genomes/
. Don't hesitate to look at other registration files to see examples of howNCBI_LINKER
is defined.IMPORTANT NOTES TO OUTREACHY APPLICANTS:
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.R CMD check
might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!