Closed hpages closed 1 year ago
Hi @hpages I'd like to work on this issue please
All right, I just assigned you.
Hi @hpages. I have been able to run R CMD build
and R CMD check
on the package with no errors (& 3 NOTES). The resources you shared within the issue were enough to get me through this task. I also followed your conversation with Prisca in pull #46 and took a TONNE of pointers, including adding PkgExamples: genome$chrA1 # same as genome[["chrA1"]]
to my seed file (the edit was done for the sake of the differing chromosomes, but as she mentioned, it is a bit difficult to understand, in spite of the shared literature. I'd also love to know what it does). I retained provider_version
over genome
, because of your mention of backwards compatibility, although the system returned a warning in this regard.
On my end, running
identical(seqinfo(Fcatus), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE))
returned FALSE
, as getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE)
and seqinfo(Fcatus)
produced different results. If this isn't an error on my part, I'm definitely curious as to why this is the case. I am currently following the rest of the conversation regarding the forgeBSgenomeDataPkg()
package (which was a pretty great question Prisca asked, and the follow up response just as useful). The intro to OO programming was definitely new, I didn't even know R had support for OOP. I'll need to refresh my knowledge!
As always, I am incredibly grateful for the information and advice you share with us. I have submitted a PR for my seed file, which I hope is satisfactory.
Hi @kakopo
As promised to Presca here, I'll provide some clarifications about the PkgExamples
field soon.
I retained _providerversion over genome, because of your mention of backwards compatibility, although the system returned a warning in this regard.
You got a warning because provider_version
is deprecated in favor of genome
. Yes provider_version
is still supported for backward compatibility but all newly created BSgenome data packages should stop using it and use genome
instead.
On my end, running
identical(seqinfo(Fcatus), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE))
returned
FALSE
Interesting! We need to understand why. Can you please share the output produced by your call to forgeBSgenomeDataPkg()
? Just copy-paste it here, and put it between two triple-backtick lines so it displays nicely. This means you'll need to forge BSgenome.Fcatus.UCSC.felCat9 again but that should be a relatively fast operation.
Thanks!
Hi @hpages. I'm actually quite embarrassed, it was a mistake on my part. I forged the package once more and realized the first time I ran R CMD INSTALL
with the package name rather than the tarball. All runs fine now and the output is identical. Still, here is the output of my call to forgeBSgenomeDataPkg()
. I've replaced provider_version
with genome
as well.
Creating package in ./BSgenome.Fcatus.UCSC.felCat9
Copying '/home/dell/seqdatafile/felCat9.2bit' to './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE
Getting chrom info from UCSC with 'getChromInfoFromUCSC("felCat9")' ... DONE
Loading './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE
Sorting sequences as in 'getChromInfoFromUCSC("felCat9")' ... DONE
Checking the sequence lengths ... OK
Writing sequences to './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE
I was trying to do the same task, this time using the FASTA file instead of a 2bit file. I came across a tool called exonerate that splits the FASTA file using the command fastaexplode felCat9.fasta
. Is this an oft used tool for the role?
Hi @kakopo ,
Thanks for replacing provider_version
with genome
in your seed file.
The output of your forgeBSgenomeDataPkg()
call looks as expected. Glad to hear that seqinfo(Fcatus)
and getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE))
are now identical, as they should.
Please read thru this long comment I posted here, where I explain what the use cases are for switching between UCSC and NCBI sequence names. As you will see, it's all about uniformizing the sequence names between two datasets. This is a very common operation. The code I show in the comment is for Dog10K_Boxer_Tasha
/canFam6
, but it should be easy to adapt for Felis_catus_9.0
/felCat9
if you want to give it a try. Just to be clear, I'm talking about the code that starts with:
library(GenomicFeatures)
and that ends with (in your case it will be Fcatus
instead of Cfamiliaris
):
seqlevelsStyle(genes_ranges) <- seqlevelsStyle(Cfamiliaris) # uniformize the sequence names
Let me know if you have questions or if you run into trouble when trying to run this code.
Other than that, your seed file looks good. I still need to explain the PkgExamples
thing though, which I will do in a separate comment.
I was trying to do the same task, this time using the FASTA file instead of a 2bit file. I came across a tool called exonerate that splits the FASTA file using the command fastaexplode felCat9.fasta. Is this an oft used tool for the role?
Never heard of this tool @kakopo. It might do the job. It's just that I find the split to be easy enough to do in R so I've always used that.
Hi @hpages. I was able to run the code in the example you shared. It was a useful lesson in the why exactly one would switch between UCSC and NCBI names. The only issue I ran into was running getSeq()
gave the following error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘getSeq’ for signature ‘"missing"’
which was not what I expected, following the code. Still,
> seqlevelsStyle(genes_ranges) <- seqlevelsStyle(Fcatus) # uniformize the sequence names
returned the correct output. The explanation for the need to switch the sequence names was very helpful, and helped me understand a lot.
I'm curious as to why running >genome(txdb) <- "Fcatus"
and > seqlevelsStyle(genes_ranges) <- seqlevelsStyle("Felis_catus_9.0")
returns the following error
Error in seqlevelsStyle("Felis_catus_9.0") :
The style does not have a compatible entry for the species supported by Seqname. Please
see genomeStyles() for supported species/style
I've been wondering why in spite of the fact that cats have both X and Y molecules in reality, a lot of the work done so far has excluded the Y chromosome. Some of the supported species in genomeStyles()
have TxDb.Mmusculus.UCSC.mm10.knownGene
which gave me hell and refused to install (it finally did when I installed the Latex dependencies). I didn't realize till now that its different from the packages we're making now. At the beginning I thought a lot of these tasks were automatic, but with the information shared, I'm really learning a lot, and I am thankful.
Also, as Prisca mentioned, using 2bit files is much, much faster than parsing through the fasta files with the forgeBSgenomeDataPkg
. I spent quite some time, as the felCat9 file has 4508 sequences. Using 2bit is definitely more effective, but are there any scenarios in which one would use the fasta files over the 2bit ones, save for preference?
Thank you!
Hi @kakopo
The only issue I ran into was running getSeq() gave the following error
Did you call getSeq()
with no arguments? You need to pass arguments to it. In our case, a BSgenome object and a GRanges object. Look how I called getSeq()
on Cfamiliaris
and genes_ranges
in the code I showed in my long comment to Presca here. Should be easy to adapt to make it work on Fcatus
.
I'm curious as to why running >genome(txdb) <- "Fcatus" and > seqlevelsStyle(genes_ranges) <- seqlevelsStyle("Felis_catus_9.0") returns the following error
What GFF file did you use to make your TxDb object? I used a GFF file located in the "FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha
in Presca's example. Oh! but unfortunately there's no link to the "FTP directory for RefSeq assembly" on the NCBI page for the Felis_catus_9.0 assembly. Ha! That's a problem :thinking:
In any case, doing genome(txdb) <- "Fcatus"
is not correct because you need to use the exact name of the NCBI assembly here, and Fcatus
is not.
Also seqlevelsStyle()
must be called on an object that contains a Seqinfo object i.e. that supports the seqinfo()
getter, like a TxDb, BSgenome, or GRanges object. Calling it on an assembly name like this:
seqlevelsStyle("Felis_catus_9.0")
is not supported.
Anyways, maybe this discussion about harmonizing sequence names is becoming too much of a distraction at this point. We'll come back to it latter.
I've been wondering why in spite of the fact that cats have both X and Y molecules in reality, a lot of the work done so far has excluded the Y chromosome.
Well, only cat males have the Y chromosome. As you can see on the NCBI page for the Felis_catus_9.0
assembly, this assembly is for the genome of a cat female.
in the earlier tasks I came across a package TxDb.Mmusculus.UCSC.mm10.knownGene which gave me hell and refused to install (it finally did when I installed the Latex dependencies)
That's weird. I've no idea why you had such a hard time installing those packages. You should have asked. AFAIK TxDb packages have nothing special and require nothing special. They certainly don't require any LaTeX dependencies to install. I'm not aware of any CRAN or Bioconductor package that depends on LaTeX to install.
are there any scenarios in which one would use the fasta files over the 2bit ones?
Yes, in some very rare situations we might want to choose FASTA over 2bit. That's because of some restrictions of the 2bit format. For example IIRC the cumulative length of all the sequences in a 2bit file cannot be more than a certain limit (2^32 I think). So if the total length of an assembly is more than that limit, then the 2bit format cannot be used. A few months ago someone was trying to forge a BSgenome data package for a Mexican axolotl assembly, and they ran into this problem. See issue #28 for the details.
Note that Mexican axolotl was the largest known animal genome until very recently. Now it seems that it's the Australian lungfish genome (14x larger than the Human genome!). See https://www.uni-wuerzburg.de/en/news-and-events/news/detail/news/the-worlds-largest-animal-genome
Oh, yes. I actually ran getSeq()
twice, once without arguments and once without, just to check what the response would be. The second time I ran getSeq(Fcatus, genes_ranges)
I got the expected output! I've seen now that its just because of the missing arguments. I try to run the code with different scenarios to see what happens. Sometimes I find missing packages or in this case, the genomeStyles()
function!
In any case, doing genome(txdb) <- "Fcatus" is not correct because you need to use the exact name of the NCBI assembly here, and Fcatus is not.
Also seqlevelsStyle() must be called on an object that contains a Seqinfo object i.e. that supports the seqinfo() getter, like a TxDb, BSgenome, or GRanges object.
I think I get it now! All along I assumed they could be interchangeable, but this is not the case. This has been quite the learning experience.
What GFF file did you use to make your TxDb object? I used a GFF file located in the "FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha in Presca's example. Oh! but unfortunately there's no link to the "FTP directory for RefSeq assembly" on the NCBI page for the Felis_catus_9.0 assembly. Ha! That's a problem
With the GFF file, I simply edited changed the link information for GFF file located in the"FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha
to match Felis_catus_9.0
so I typed this instead which seemed to do the job https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/181/335/GCF_000181335.3_Felis_catus_9.0/GCF_000181335.3_Felis_catus_9.0_genomic.gff.gz.
That's weird. I've no idea why you had such a hard time installing those packages. You should have asked
I often think that problems such as packages are too simple to bother you with! I'll definitely bring up such issues from now on. Actually, I've been trying to install the RMariaDB
package for a few days now, but I keep running into this error:
Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/RMariaDB_1.2.2.tar.gz'
Content type 'application/x-gzip' length 905003 bytes (883 KB)
==================================================
downloaded 883 KB
* installing *source* package ‘RMariaDB’ ...
** package ‘RMariaDB’ successfully unpacked and MD5 sums checked
** using staged installation
Using PKG_CFLAGS=
Using PKG_LIBS=-lmysqlclient
Using PKG_PLOGR=
-----------------------------[ ANTICONF ]-----------------------------
Configure could not find suitable mysql/mariadb client library. Try installing:
* deb: libmariadb-dev (Debian, Ubuntu)
* rpm: mariadb-connector-c-devel | mariadb-devel | mysql-devel (Fedora, CentOS, RHEL)
* csw: mysql56_dev (Solaris)
* brew: mariadb-connector-c (OSX)
If you already have a mysql client library installed, verify that either
mariadb_config or mysql_config is on your PATH. If these are unavailable
you can also set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------[ ERROR MESSAGE ]----------------------------
<stdin>:1:10: fatal error: mysql.h: No such file or directory
compilation terminated.
-----------------------------------------------------------------------
ERROR: configuration failed for package ‘RMariaDB’
* removing ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2/RMariaDB’
Warning in install.packages :
installation of package ‘RMariaDB’ had non-zero exit status
and trying to install libmariadb-dev
gave:
Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘libmariadb-dev’ is not available for this version of R
Maybe I had tried something else, but the moment I installed all the missing LaTeX packages in issue #49, TxDb.Mmusculus.UCSC.mm10.knownGene
installed afterwards.
Otherwise, I can definitely see the rationale to use FASTA files while forging BSgenome packages. Reading of a 9GB tarball file is wild! And I thought the tarball I generated, that clocked to almost 600 mbs was heavy! That 2bit indexes must be 4GB and less is definitely noted. This article was as enlightening as it was interesting. The bit comparing lungfish genetic elements to a computer virus "They multiply on their own but don't have a function. As a scientist you wonder why the 'genetic hard drive' of the lungfish has not crashed long ago given the high number of transposable elements" was amusing, and also made me realize that I had not considered the fact that all (or most) of these chromosomes had specific importance. I'm reading a bit more on this. Thank you for sharing the article, and for answering my questions!
and trying to install
libmariadb-dev
gave:Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’ (as ‘lib’ is unspecified) Warning in install.packages : package ‘libmariadb-dev’ is not available for this version of R
libmariadb-dev
is not an R package. It's a system library available as a Debian package! This error message:Configure could not find suitable mysql/mariadb client library. Try installing: * deb: libmariadb-dev (Debian, Ubuntu) * rpm: mariadb-connector-c-devel | mariadb-devel | mysql-devel (Fedora, CentOS, RHEL) * csw: mysql56_dev (Solaris) * brew: mariadb-connector-c (OSX)
shows the name of the system component that needs to be installed for various operating systems. For yours (Ubuntu), it says
* deb: libmariadb-dev (Debian, Ubuntu)
.deb
stands for Debian package. Like all Debian packages, you need to install it at the Unix command line withsudo apt-get install <package-name>
.
It's great that you were able to find a GFF file for Felis_catus_9.0
on the NCBI FTP site!
Since I thought it would be too complicated to find this file, I made up the little story below. Now that I'm finished writing it, I still share it with you, even if you don't really need this anymore (because you found the GFF file). Feel free to ignore.
Let's say that you work for a team of scientists that study the Cat genome. One day, one coworker announces that she's identified 6 new genes on this genome. She sends you an email with the following information:
Dear colleague,
Last week I identified 6 new genes on the Cat genome. I'm very excited! I used the Felis_catus_9.0 assembly
(GCF_000181335.3) for my work. Here are the genes I identified, with their positions w.r.t. the
Felis_catus_9.0 assembly:
gene name sequence start position end position
--------- -------- -------------- ------------
gene1 chrC1 300501 360880
gene2 chrC2 1420991 1485760
gene3 chrC2 1985444 2075620
gene4 chrC1_NW_019365375v1_random 935 15777
gene5 chrC2_NW_019365398v1_random 4466 22888
gene6 chrC2_NW_019365399v1_random 2801 9469
I would need the DNA sequences for those genes, which I think you should be able to obtain using
Bioconductor. Unfortunately I don't know how to use Bioconductor. Do you think you could help me
with that?
Kind regards
To help her with that, we're going to put the gene positions in a GRanges object:
library(GenomicRanges)
## We construct the GRanges manually. Note that our coworker would normally send the
## gene positions in a text file, typically a tabulated file, that we would import in R with
## read.table() and then convert to a GRanges object.
genes_ranges <- c(gene1="chrC2:300501-360880",
gene2="chrC2:1420991-1485760",
gene3="chrC2:1985444-2075620",
gene4="chrC1_NW_019365375v1_random:935-15777",
gene5="chrC2_NW_019365398v1_random:4466-22888",
gene6="chrC2_NW_019365399v1_random:2801-9469")
genes_ranges <- GRanges(genes_ranges)
genes_ranges
# GRanges object with 6 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# gene1 chrC2 300501-360880 *
# gene2 chrC2 1420991-1485760 *
# gene3 chrC2 1985444-2075620 *
# gene4 chrC1_NW_019365375v1.. 935-15777 *
# gene5 chrC2_NW_019365398v1.. 4466-22888 *
# gene6 chrC2_NW_019365399v1.. 2801-9469 *
# -------
# seqinfo: 4 sequences from an unspecified genome; no seqlengths
seqinfo(genes_ranges)
# Seqinfo object with 4 sequences from an unspecified genome; no seqlengths:
# seqnames seqlengths isCircular genome
# chrC2 NA NA <NA>
# chrC1_NW_019365375v1_random NA NA <NA>
# chrC2_NW_019365398v1_random NA NA <NA>
# chrC2_NW_019365399v1_random NA NA <NA>
We know that the gene positions are w.r.t. the Felis_catus_9.0
so let's fix the genome field above:
## Same as doing genome(seqinfo(genes_ranges)) <- "Felis_catus_9.0"
genome(genes_ranges) <- "Felis_catus_9.0"
seqinfo(genes_ranges)
# Seqinfo object with 4 sequences from Felis_catus_9.0 genome; no seqlengths:
# seqnames seqlengths isCircular genome
# chrC2 NA NA Felis_catus_9.0
# chrC1_NW_019365375v1_random NA NA Felis_catus_9.0
# chrC2_NW_019365398v1_random NA NA Felis_catus_9.0
# chrC2_NW_019365399v1_random NA NA Felis_catus_9.0
Then load BSgenome.Fcatus.UCSC.felCat9 and use getSeq()
to try to extract the gene sequences. See how I call getSeq()
in Presca's example and adapt to do the same here. Assuming you're calling getSeq()
correctly, you'll still get an error like I got the first time I called getSeq()
in Presca's example. Something about a sequence not found. That's because the sequence names in your genes_ranges
are not found in Fcatus
. You'll need to fix this by uniformizing the sequence names.
H.
See https://github.com/Bioconductor/BSgenome/pull/46#issuecomment-1291424086 for the long due explanation about PkgExamples
. Let me know if you have any questions.
Hi @kakopo ,
I just merged PR #47.
Don't miss my long due explanation about PkgExamples
: https://github.com/Bioconductor/BSgenome/pull/46#issuecomment-1291424086 And do not hesitate to ask if you have any questions.
Next task in your group is #43. It's still about Cat! :cat2: Whenever you are ready, go there and ask to be assigned.
Don't forget to record your contributions on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.
Thank you.
I was able to successfully install RMariaDB! The story was much appreciated too- now I know that the GRanges objects are done manually. I have also gone through the explanation about PkgExamples
in #46 (comment) which was very insightful and helpful. I take it that this means that simply adding explanatory comments into the PkgExamples
field would still show up as valid examples in the package.Rd
file.
I take it that this means that simply adding explanatory comments into the
PkgExamples
field would still show up as valid examples in thepackage.Rd
file.
Yes. If you check the man page for your BSgenome.Fcatus.UCSC.felCat9 package, you'll see that the genome$chrA1 # same as genome[["chrA1"]]
line was injected as-is.
This task depends on this issue being completed first (i.e. PR accepted and merged, and issue closed). Although it's not a requirement that the 2 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.
BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Hsapiens.UCSC.hg19 is a BSgenome data package that contains the genomic sequences of the
hg19
genome from UCSC. Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.This task's goal is to make a new BSgenome data package for UCSC genome
felCat9
. The process of making such package is documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.Other useful links:
BSgenome issues on GitHub where many Bioconductor users who went thru the process of forging a BSgenome data package have already asked questions about this process. Note that most issues where those questions have been asked (and answered) are now closed so do not exclude closed issues from your research.
Some users have also asked questions about this process on the Bioconductor support site. See questions tagged with "BSgenome" there: https://support.bioconductor.org/tag/bsgenome/
The BSgenome.Hsapiens.UCSC.hg19 landing page: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
List of BSgenome data packages available in Bioconductor: https://bioconductor.org/packages/release/BiocViews.html#___BSgenome
https://github.com/Bioconductor/BSgenomeForge for more information about BSgenome data packages and additional links.
IMPORTANT NOTES TO OUTREACHY APPLICANTS:
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.R CMD check
might produce some NOTEs and even some WARNINGs. Let me know if that's the case and we'll discuss them.inst/extdata/Outreachy/
folder of the BSgenome software package. You'll need to fork the BSgenome repository for that, then add the seed file, commit, push, and submit a PR.