Closed hpages closed 2 years ago
Hello, I would like to work on this task
Hi @Linjelium11, these tasks are for Outreachy applicants. Are you an Outreachy applicant? If so, please contact me using my contact information provided on the Outreachy site and introduce yourself. Looking forward to hear from you. Thank you!
Hello, this is Iman again. I would like to be assigned to this task.
@Iman-L This task is part of the "Cat" group. You'll be working on the "Gorilla" group. But first see the email I'm about to send you. Thanks
Hello @hpages could I please be assigned this task?
Done. I'm ready to answer your questions, don't hesitate!
Awesome!
Hi @hpages. I am (tentatively) done with this task. I ran into a few issues, at least regarding unexpected output, but thankfully, I was able to refer to your conversation with @Priceless-P regarding issue #44. Turns out, I was facing the same exact problem and it helped. (Prisca has also assisted me tremendously via Slack). Everything prior to testing the package is as expected (via command line). Running R CMD build
worked fine, but R CMD check GenomeInfoDb_1.33.13.tar.gz --no-tests
brings about this output
* checking examples ... ERROR
Running examples in ‘GenomeInfoDb-Ex.R’ failed
The error most likely occurred in:
> ### Name: seqlevelsStyle
> ### Title: Conveniently rename the seqlevels of an object according to a
> ### given style
> ### Aliases: seqlevelsStyle seqlevelsStyle<- seqlevelsStyle,ANY-method
> ### seqlevelsStyle<-,ANY-method seqlevelsStyle,Seqinfo-method
> ### seqlevelsStyle<-,Seqinfo-method seqlevelsStyle,character-method
> ### seqlevelsStyle<-,character-method genomeStyles extractSeqlevels
> ### extractSeqlevelsByGroup mapSeqlevels seqlevelsInGroup
>
> ### ** Examples
>
> ## ---------------------------------------------------------------------
> ## seqlevelsStyle() getter and setter
> ## ---------------------------------------------------------------------
>
> ## On a character vector:
> x <- paste0("chr", 1:5)
> seqlevelsStyle(x)
[1] "UCSC"
> seqlevelsStyle(x) <- "NCBI"
> x
[1] "1" "2" "3" "4" "5"
>
> ## On a GRanges object:
> library(GenomicRanges)
> gr <- GRanges(rep(c("chr2", "chr3", "chrM"), 2), IRanges(1:6, 10))
>
> seqlevelsStyle(gr)
[1] "UCSC"
> seqlevelsStyle(gr) <- "NCBI"
> gr
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 2 1-10 *
[2] 3 2-10 *
[3] MT 3-10 *
[4] 2 4-10 *
[5] 3 5-10 *
[6] MT 6-10 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
>
> seqlevelsStyle(gr)
[1] "NCBI" "Ensembl"
> seqlevelsStyle(gr) <- "dbSNP"
> gr
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] ch2 1-10 *
[2] ch3 2-10 *
[3] chMT 3-10 *
[4] ch2 4-10 *
[5] ch3 5-10 *
[6] chMT 6-10 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
>
> seqlevelsStyle(gr)
[1] "dbSNP"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 1-10 *
[2] chr3 2-10 *
[3] chrM 3-10 *
[4] chr2 4-10 *
[5] chr3 5-10 *
[6] chrM 6-10 *
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
>
> ## In general the seqlevelsStyle() setter doesn't know how to rename
> ## scaffolds. However, if the genome is specified, it's very likely
> ## that seqlevelsStyle() will be able to take advantage of that:
> gr <- GRanges(rep(c("2", "Y", "Hs6_111610_36"), 2), IRanges(1:6, 10))
> genome(gr) <- "NCBI36"
> seqlevelsStyle(gr) <- "UCSC"
> gr
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 1-10 *
[2] chrY 2-10 *
[3] chr6_cox_hap1 3-10 *
[4] chr2 4-10 *
[5] chrY 5-10 *
[6] chr6_cox_hap1 6-10 *
-------
seqinfo: 3 sequences from hg18 genome; no seqlengths
>
> ## On a Seqinfo object:
> si <- si0 <- Seqinfo(genome="apiMel2")
> si
Seqinfo object with 17 sequences from apiMel2 genome:
seqnames seqlengths isCircular genome
Group1 25433307 FALSE apiMel2
Group2 14233499 FALSE apiMel2
Group3 11790680 FALSE apiMel2
Group4 10665743 FALSE apiMel2
Group5 13312070 FALSE apiMel2
... ... ... ...
Group13 8846036 FALSE apiMel2
Group14 8017760 FALSE apiMel2
Group15 7326077 FALSE apiMel2
Group16 5608962 FALSE apiMel2
GroupUn 321143811 FALSE apiMel2
> seqlevelsStyle(si) <- "NCBI"
Warning in (function (seqlevels, genome, new_style) :
cannot switch some of apiMel2's seqlevels from UCSC to NCBI style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
seqnames seqlengths isCircular genome
LG1 25433307 FALSE Amel_2.0
LG2 14233499 FALSE Amel_2.0
LG3 11790680 FALSE Amel_2.0
LG4 10665743 FALSE Amel_2.0
LG5 13312070 FALSE Amel_2.0
... ... ... ...
LG13 8846036 FALSE Amel_2.0
LG14 8017760 FALSE Amel_2.0
LG15 7326077 FALSE Amel_2.0
LG16 5608962 FALSE Amel_2.0
GroupUn 321143811 FALSE apiMel2
> seqlevelsStyle(si) <- "RefSeq"
Warning in (function (seqlevels, genome, new_style) :
cannot switch apiMel2's seqlevels from UCSC to RefSeq style
> si
Seqinfo object with 17 sequences from 2 genomes (Amel_2.0, apiMel2):
seqnames seqlengths isCircular genome
NC_007070.1 25433307 FALSE Amel_2.0
NC_007071.1 14233499 FALSE Amel_2.0
NC_007072.1 11790680 FALSE Amel_2.0
NC_007073.1 10665743 FALSE Amel_2.0
NC_007074.1 13312070 FALSE Amel_2.0
... ... ... ...
NC_007082.1 8846036 FALSE Amel_2.0
NC_007083.1 8017760 FALSE Amel_2.0
NC_007084.1 7326077 FALSE Amel_2.0
NC_007085.1 5608962 FALSE Amel_2.0
GroupUn 321143811 FALSE apiMel2
> seqlevelsStyle(si) <- "UCSC"
> stopifnot(identical(si0, si))
>
> si <- si0 <- Seqinfo(genome="WBcel235")
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
seqnames seqlengths isCircular genome
I 15072434 FALSE WBcel235
II 15279421 FALSE WBcel235
III 13783801 FALSE WBcel235
IV 17493829 FALSE WBcel235
V 20924180 FALSE WBcel235
X 17718942 FALSE WBcel235
MT 13794 TRUE WBcel235
> seqlevelsStyle(si) <- "UCSC"
> si
Seqinfo object with 7 sequences (1 circular) from ce11 genome:
seqnames seqlengths isCircular genome
chrI 15072434 FALSE ce11
chrII 15279421 FALSE ce11
chrIII 13783801 FALSE ce11
chrIV 17493829 FALSE ce11
chrV 20924180 FALSE ce11
chrX 17718942 FALSE ce11
chrM 13794 TRUE ce11
> seqlevelsStyle(si) <- "RefSeq"
> si
Seqinfo object with 7 sequences (1 circular) from WBcel235 genome:
seqnames seqlengths isCircular genome
NC_003279.8 15072434 FALSE WBcel235
NC_003280.10 15279421 FALSE WBcel235
NC_003281.10 13783801 FALSE WBcel235
NC_003282.8 17493829 FALSE WBcel235
NC_003283.11 20924180 FALSE WBcel235
NC_003284.9 17718942 FALSE WBcel235
NC_001328.1 13794 TRUE WBcel235
> seqlevelsStyle(si) <- "NCBI"
> stopifnot(identical(si0, si))
>
> si <- Seqinfo(genome="macFas5")
> si
Seqinfo object with 7601 sequences (1 circular) from macFas5 genome:
seqnames seqlengths isCircular genome
chr1 227556264 FALSE macFas5
chr2 192460366 FALSE macFas5
chr3 192294377 FALSE macFas5
chr4 170955103 FALSE macFas5
chr5 189454096 FALSE macFas5
... ... ... ...
chrUn_KE147191 5581 FALSE macFas5
chrUn_KE147192 21004 FALSE macFas5
chrUn_KE147193 5184 FALSE macFas5
chrUn_KE147194 8082 FALSE macFas5
chrUn_KE147195 7181 FALSE macFas5
> seqlevelsStyle(si) <- "NCBI"
Error in download.file(url, destfile, quiet = TRUE) :
cannot open URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/364/345/GCA_000364345.1_Macaca_fascicularis_5.0/GCA_000364345.1_Macaca_fascicularis_5.0_assembly_report.txt'
Calls: seqlevelsStyle<- ... .fetch_assembly_report_from_url -> fetch_table_from_url
Execution halted
This is what my script currently looks like
ORGANISM <- "Felis catus"
### List of assemblies by date.
ASSEMBLIES <- list(
list(assembly="Felis_catus_9.0",
date="2017/11/20",
extra_info=c(breed="Abyssinian"),
assembly_accession="GCF_000181335.3", # felCat9
circ_seqs=c("MT"))
)
Thank you
@kakopo Great to hear that the discussion in issue #44 helped you, and that Prisca was able to assist you via Slack!
To me, your Felis_catus.R
file looks good. I don't know why you're getting this error in the examples of the seqlevelsStyle
man page. Looks like some connectivity issue with the NCBI FTP site (ftp.ncbi.nlm.nih.gov). Probably a temporary thing.
Try to run the examples in an interactive R session by loading the GenomeInfoDb package and doing:
example(seqlevelsStyle)
This will run all the examples in the seqlevelsStyle
man page (?seqlevelsStyle
). Hopefully that works.
Then try R CMD check --no-tests
again and maybe this time you'll have better luck. Yeah, examples and unit tests that access the internet can fail sometimes, in unpredictable ways. :disappointed:
If after a couple of attempts, you still run into this problem, then ignore.
I would just avoid wrapping "MT"
inside c()
, because c("MT")
is the same as "M"
(try identical(c("MT"), "M")
), so you don't need it (it's actually not considered good R programming style).
Then add the file to git, commit, push, and submit a PR.
Thanks!
example(seqlevelsStyle)
worked instantenously. Thank you. I hadn't realised I'd left the c("MT") part of the code - I blindly referenced another script and had no idea I had kept it in. Definitely a lesson in being more intentional!
Otherwise, my curiosity lies in the fact that I noticed the fact that NCBI has 2 versions of the Felis catus assembly, with the more recent version containing a slightly updated GenBank assembly accession (but no RefSeq assembly accession). I used the older version, but what difference would using the newer one make, given that it seems incomplete? I've also been meaning to ask what exactly circ_seqs
means within the code.
I've just submitted the PR. Thanks for everything!
Hi @kakopo
example(seqlevelsStyle) worked instantenously.
Excellent!
I've also been meaning to ask what exactly circ_seqs means within the code.
circ_seqs
stands for "circular sequences". All the sequences in a genome assembly are DNA sequences. Most of the time, DNA molecules are linear, but sometimes they are circular. For example, the autosomes in Human (that is, chromosomes 1-22, X, and Y) are linear DNA molecules. But the mitochondrial chromosome, which is a very small DNA molecule compared to the autosomes, is circular. Scaffolds are always considered linear.
For Cat, it's going to be the same: all the sequences in the assembly are linear, except for MT
. We can check that MT
is circular as follow:
GCF_000181335.3
and follow the link to the "Full sequence report" on the right of the page. See my comment here about what the "Full sequence report" is about.MT
in the report (tip: it's usually at the very bottom of the file).RefSeq-Accn
column).LOCUS
line. It says the following:
LOCUS NC_001700 17009 bp DNA circular MAM 07-JUN-2021
This tells us that this sequence is a DNA sequence of length 17009 bp ("bp" stands for base pairs), and that it's a circular sequence.
ORIGIN
section shows the sequence itself:
ORIGIN
1 ggactaatga ctaatcagcc catgatcaca cataactgtg gtgtcatgca tttggtattt
61 tttattttta gggggtcgaa cttgctatga ctcagctatg acctaaaggt cctgactcag
121 tcaaatatat tgttgctggg cttattctct atgcggggtc tccacacgca cagacagtca
...
...
...
16861 cttgctcgaa cgtgtacctc ttctcgctcc gggcccattt caacgtgggg gtttctataa
16921 cggaactata cctggcatct ggttcttact tcagggccat aaaatccttg aaaccaatcc
16981 ttcagttctc tcaaatggga catctcgat
So now we have confirmation that the MT
sequence in the Felis_catus_9.0
assembly is circular. Note that we've also marked the mitochondrial chromosome in UCSC genome felCat9
as circular (we'd better be consistent! :wink: ), by putting the following line in felCat9.R
:
CIRC_SEQS <- "chrM"
chrM
in felCat9
is the same sequence as MT
in Felis_catus_9.0
, only the names are different.
See https://en.wikipedia.org/wiki/Mitochondrial_DNA if you are interested to learn more about Mitochondrial DNA.
The two most important things to remember are:
Felis_catus_9.0
!).Felis_catus_9.0
), and it's circular. I noticed the fact that NCBI has 2 versions of the Felis catus assembly
I suppose you mean they have 2 assemblies named Felis_catus_9.0
. Good catch!
The fact that they have several assemblies for the same organism (Felis catus in this case) is not a problem and is pretty common. The NCBI assembly page for GCF_000181335.3
has this blue box that says that there are 5 assemblies for this organism:
If you click on the link in that box, you will see the list of all NCBI assemblies for Felis catus.
But what's unusual and confusing here is that 2 of those assemblies are named Felis_catus_9.0
. I honestly don't know why. Let's remember that NCBI is just a huge centralized repository where people can submit assemblies (these are usually the people working on the assemblies). I guess people are free to name the assembly as they want. However, a given submission will always have a unique GenBank or RefSeq assembly accession.
The lesson here is that assembly names are unreliable and messy. The only thing that we should rely on to identify an NCBI assembly unambiguously is a GenBank or RefSeq assembly accession.
In this particular case, we deliberately choose to register the Felis_catus_9.0
assembly associated with GCF_000181335.3
, and not the other one, because this is the assembly that UCSC genome felCat9
is based on.
All is good with your PR. Just a very minor cosmetic detail. If you look at the Files changed, it shows a little red circle below the last line of the file. If you move the mouse over it, a little popup will say what the problem is. Can you please address that?
Thanks
Quite interesting! I've just done a bit of light reading on the difference between circular and linear chromosomes and everything is definitely making much more sense. This entire field is quite complex, and quite interesting to dive into. Thank you for answering my questions, and for the resources, always. I had made changes to get rid of the red circle. I am unsure as to why it is not reflecting as such, especially since I added the new line locally and pushed once more. My most recent commit is from 2 hours ago. Is this the current file on your end?
Looks good on my end. See https://github.com/Bioconductor/GenomeInfoDb/pull/64/files
PR #64 merged, thanks @kakopo
Next task in your group is issue #51. Whenever you are ready, go there and ask to be assigned.
Awesome, thank you as well
Felis_catus_9.0
is a Cat assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000181335.3/Note that
Felis_catus_9.0
is the assembly thatfelCat9
, the latest UCSC genome for Cat, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Cat in the UCSC species tree on the left, click on it, then make sure to select the latest Cat Assembly (
felCat9
). This will display a bunch of additional information about thefelCat9
assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*
) or RefSeq (GCF_000*.*
) accession number.Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The
registered_NCBI_assemblies()
function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is thatgetChromInfoFromNCBI()
still works on an unregistered assembly, but in "degraded" mode, that is:NA
s instead ofFALSE
s in thecircular
column of the returned data.frame.Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb,
getChromInfoFromNCBI()
will recognize its name and return accurate circularity flags.See
?getChromInfoFromNCBI
(after loading GenomeInfoDb) for more information.Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in
GenomeInfoDb/inst/registered/NCBI_assemblies/
. If this is a new organism, then we need to start a new file. See the other files for the naming scheme: the name of the file must be the full scientific name of the organism, with the underscore used as separator, and with the first letter capitalized. File extension must be.R
.IMPORTANT NOTES TO OUTREACHY APPLICANTS:
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.R CMD check
might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!