Proposed task for Outreachy applicants: Forge BSgenome data package for UCSC genome felCat9

Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs

https://bioconductor.org/packages/BSgenome

9 stars 8 forks source link

Proposed task for Outreachy applicants: Forge BSgenome data package for UCSC genome felCat9 #42

Closed hpages closed 1 year ago

hpages commented 1 year ago

This task depends on this issue being completed first (i.e. PR accepted and merged, and issue closed). Although it's not a requirement that the 2 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Hsapiens.UCSC.hg19 is a BSgenome data package that contains the genomic sequences of the hg19 genome from UCSC. Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.

This task's goal is to make a new BSgenome data package for UCSC genome felCat9. The process of making such package is documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.

Other useful links:

BSgenome issues on GitHub where many Bioconductor users who went thru the process of forging a BSgenome data package have already asked questions about this process. Note that most issues where those questions have been asked (and answered) are now closed so do not exclude closed issues from your research.
Some users have also asked questions about this process on the Bioconductor support site. See questions tagged with "BSgenome" there: https://support.bioconductor.org/tag/bsgenome/
The BSgenome.Hsapiens.UCSC.hg19 landing page: https://bioconductor.org/packages/BSgenome.Hsapiens.UCSC.hg19
List of BSgenome data packages available in Bioconductor: https://bioconductor.org/packages/release/BiocViews.html#___BSgenome
https://github.com/Bioconductor/BSgenomeForge for more information about BSgenome data packages and additional links.

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, there's no need to fork anything. The goal of the task is to produce a new package (a BSgenome data package), but we're not going to add this package to git or GitHub.
To test the new package: install it, start R, load the package, and play around with it. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. Let me know if that's the case and we'll discuss them.
Once you are confident that your new BSgenome data package works as expected, please add your seed file to the inst/extdata/Outreachy/ folder of the BSgenome software package. You'll need to fork the BSgenome repository for that, then add the seed file, commit, push, and submit a PR.
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

kakopo commented 1 year ago

Hi @hpages I'd like to work on this issue please

hpages commented 1 year ago

All right, I just assigned you.

kakopo commented 1 year ago

Hi @hpages. I have been able to run R CMD build and R CMD check on the package with no errors (& 3 NOTES). The resources you shared within the issue were enough to get me through this task. I also followed your conversation with Prisca in pull #46 and took a TONNE of pointers, including adding PkgExamples: genome$chrA1 # same as genome[["chrA1"]] to my seed file (the edit was done for the sake of the differing chromosomes, but as she mentioned, it is a bit difficult to understand, in spite of the shared literature. I'd also love to know what it does). I retained provider_version over genome, because of your mention of backwards compatibility, although the system returned a warning in this regard. On my end, running

identical(seqinfo(Fcatus), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE))

returned FALSE, as getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE) and seqinfo(Fcatus) produced different results. If this isn't an error on my part, I'm definitely curious as to why this is the case. I am currently following the rest of the conversation regarding the forgeBSgenomeDataPkg() package (which was a pretty great question Prisca asked, and the follow up response just as useful). The intro to OO programming was definitely new, I didn't even know R had support for OOP. I'll need to refresh my knowledge! As always, I am incredibly grateful for the information and advice you share with us. I have submitted a PR for my seed file, which I hope is satisfactory.

hpages commented 1 year ago

Hi @kakopo

As promised to Presca here, I'll provide some clarifications about the PkgExamples field soon.

I retained _providerversion over genome, because of your mention of backwards compatibility, although the system returned a warning in this regard.

You got a warning because provider_version is deprecated in favor of genome. Yes provider_version is still supported for backward compatibility but all newly created BSgenome data packages should stop using it and use genome instead.

On my end, running

identical(seqinfo(Fcatus), getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE))

returned FALSE

Interesting! We need to understand why. Can you please share the output produced by your call to forgeBSgenomeDataPkg()? Just copy-paste it here, and put it between two triple-backtick lines so it displays nicely. This means you'll need to forge BSgenome.Fcatus.UCSC.felCat9 again but that should be a relatively fast operation.

Thanks!

kakopo commented 1 year ago

Hi @hpages. I'm actually quite embarrassed, it was a mistake on my part. I forged the package once more and realized the first time I ran R CMD INSTALL with the package name rather than the tarball. All runs fine now and the output is identical. Still, here is the output of my call to forgeBSgenomeDataPkg(). I've replaced provider_version with genome as well.

Creating package in ./BSgenome.Fcatus.UCSC.felCat9 
Copying '/home/dell/seqdatafile/felCat9.2bit' to './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE
Getting chrom info from UCSC with 'getChromInfoFromUCSC("felCat9")' ... DONE
Loading './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE
Sorting sequences as in 'getChromInfoFromUCSC("felCat9")' ... DONE
Checking the sequence lengths ... OK
Writing sequences to './BSgenome.Fcatus.UCSC.felCat9/inst/extdata/single_sequences.2bit' ... DONE

kakopo commented 1 year ago

I was trying to do the same task, this time using the FASTA file instead of a 2bit file. I came across a tool called exonerate that splits the FASTA file using the command fastaexplode felCat9.fasta. Is this an oft used tool for the role?

hpages commented 1 year ago

Hi @kakopo ,

Thanks for replacing provider_version with genome in your seed file.

The output of your forgeBSgenomeDataPkg() call looks as expected. Glad to hear that seqinfo(Fcatus) and getChromInfoFromUCSC("felCat9", as.Seqinfo=TRUE)) are now identical, as they should.

Please read thru this long comment I posted here, where I explain what the use cases are for switching between UCSC and NCBI sequence names. As you will see, it's all about uniformizing the sequence names between two datasets. This is a very common operation. The code I show in the comment is for Dog10K_Boxer_Tasha/canFam6, but it should be easy to adapt for Felis_catus_9.0/felCat9 if you want to give it a try. Just to be clear, I'm talking about the code that starts with:

library(GenomicFeatures)

and that ends with (in your case it will be Fcatus instead of Cfamiliaris):

seqlevelsStyle(genes_ranges) <- seqlevelsStyle(Cfamiliaris)  # uniformize the sequence names

Let me know if you have questions or if you run into trouble when trying to run this code.

Other than that, your seed file looks good. I still need to explain the PkgExamples thing though, which I will do in a separate comment.

hpages commented 1 year ago

I was trying to do the same task, this time using the FASTA file instead of a 2bit file. I came across a tool called exonerate that splits the FASTA file using the command fastaexplode felCat9.fasta. Is this an oft used tool for the role?

Never heard of this tool @kakopo. It might do the job. It's just that I find the split to be easy enough to do in R so I've always used that.

kakopo commented 1 year ago

Hi @hpages. I was able to run the code in the example you shared. It was a useful lesson in the why exactly one would switch between UCSC and NCBI names. The only issue I ran into was running getSeq() gave the following error

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘getSeq’ for signature ‘"missing"’

which was not what I expected, following the code. Still,

> seqlevelsStyle(genes_ranges) <- seqlevelsStyle(Fcatus)  # uniformize the sequence names

returned the correct output. The explanation for the need to switch the sequence names was very helpful, and helped me understand a lot. I'm curious as to why running >genome(txdb) <- "Fcatus" and > seqlevelsStyle(genes_ranges) <- seqlevelsStyle("Felis_catus_9.0") returns the following error

Error in seqlevelsStyle("Felis_catus_9.0") : 
  The style does not have a compatible entry for the species supported by Seqname. Please
  see genomeStyles() for supported species/style

I've been wondering why in spite of the fact that cats have both X and Y molecules in reality, a lot of the work done so far has excluded the Y chromosome. Some of the supported species in genomeStyles() have next to Ensembl in their tables. Why would this be the case? Learning more about txdb objects has been really great too- in the earlier tasks I came across a package TxDb.Mmusculus.UCSC.mm10.knownGene which gave me hell and refused to install (it finally did when I installed the Latex dependencies). I didn't realize till now that its different from the packages we're making now. At the beginning I thought a lot of these tasks were automatic, but with the information shared, I'm really learning a lot, and I am thankful. Also, as Prisca mentioned, using 2bit files is much, much faster than parsing through the fasta files with the forgeBSgenomeDataPkg. I spent quite some time, as the felCat9 file has 4508 sequences. Using 2bit is definitely more effective, but are there any scenarios in which one would use the fasta files over the 2bit ones, save for preference? Thank you!

hpages commented 1 year ago

Hi @kakopo

The only issue I ran into was running getSeq() gave the following error

Did you call getSeq() with no arguments? You need to pass arguments to it. In our case, a BSgenome object and a GRanges object. Look how I called getSeq() on Cfamiliaris and genes_ranges in the code I showed in my long comment to Presca here. Should be easy to adapt to make it work on Fcatus.

I'm curious as to why running >genome(txdb) <- "Fcatus" and > seqlevelsStyle(genes_ranges) <- seqlevelsStyle("Felis_catus_9.0") returns the following error

What GFF file did you use to make your TxDb object? I used a GFF file located in the "FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha in Presca's example. Oh! but unfortunately there's no link to the "FTP directory for RefSeq assembly" on the NCBI page for the Felis_catus_9.0 assembly. Ha! That's a problem :thinking:

In any case, doing genome(txdb) <- "Fcatus" is not correct because you need to use the exact name of the NCBI assembly here, and Fcatus is not.

Also seqlevelsStyle() must be called on an object that contains a Seqinfo object i.e. that supports the seqinfo() getter, like a TxDb, BSgenome, or GRanges object. Calling it on an assembly name like this:

seqlevelsStyle("Felis_catus_9.0")

is not supported.

Anyways, maybe this discussion about harmonizing sequence names is becoming too much of a distraction at this point. We'll come back to it latter.

I've been wondering why in spite of the fact that cats have both X and Y molecules in reality, a lot of the work done so far has excluded the Y chromosome.

Well, only cat males have the Y chromosome. As you can see on the NCBI page for the Felis_catus_9.0 assembly, this assembly is for the genome of a cat female.

in the earlier tasks I came across a package TxDb.Mmusculus.UCSC.mm10.knownGene which gave me hell and refused to install (it finally did when I installed the Latex dependencies)

That's weird. I've no idea why you had such a hard time installing those packages. You should have asked. AFAIK TxDb packages have nothing special and require nothing special. They certainly don't require any LaTeX dependencies to install. I'm not aware of any CRAN or Bioconductor package that depends on LaTeX to install.

are there any scenarios in which one would use the fasta files over the 2bit ones?

Yes, in some very rare situations we might want to choose FASTA over 2bit. That's because of some restrictions of the 2bit format. For example IIRC the cumulative length of all the sequences in a 2bit file cannot be more than a certain limit (2^32 I think). So if the total length of an assembly is more than that limit, then the 2bit format cannot be used. A few months ago someone was trying to forge a BSgenome data package for a Mexican axolotl assembly, and they ran into this problem. See issue #28 for the details.

Note that Mexican axolotl was the largest known animal genome until very recently. Now it seems that it's the Australian lungfish genome (14x larger than the Human genome!). See https://www.uni-wuerzburg.de/en/news-and-events/news/detail/news/the-worlds-largest-animal-genome

kakopo commented 1 year ago

Oh, yes. I actually ran getSeq() twice, once without arguments and once without, just to check what the response would be. The second time I ran getSeq(Fcatus, genes_ranges) I got the expected output! I've seen now that its just because of the missing arguments. I try to run the code with different scenarios to see what happens. Sometimes I find missing packages or in this case, the genomeStyles() function!

In any case, doing genome(txdb) <- "Fcatus" is not correct because you need to use the exact name of the NCBI assembly here, and Fcatus is not.

Also seqlevelsStyle() must be called on an object that contains a Seqinfo object i.e. that supports the seqinfo() getter, like a TxDb, BSgenome, or GRanges object.

I think I get it now! All along I assumed they could be interchangeable, but this is not the case. This has been quite the learning experience.

What GFF file did you use to make your TxDb object? I used a GFF file located in the "FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha in Presca's example. Oh! but unfortunately there's no link to the "FTP directory for RefSeq assembly" on the NCBI page for the Felis_catus_9.0 assembly. Ha! That's a problem

With the GFF file, I simply edited changed the link information for GFF file located in the"FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha to match Felis_catus_9.0 so I typed this instead which seemed to do the job https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/181/335/GCF_000181335.3_Felis_catus_9.0/GCF_000181335.3_Felis_catus_9.0_genomic.gff.gz.

That's weird. I've no idea why you had such a hard time installing those packages. You should have asked

I often think that problems such as packages are too simple to bother you with! I'll definitely bring up such issues from now on. Actually, I've been trying to install the RMariaDB package for a few days now, but I keep running into this error:

Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/RMariaDB_1.2.2.tar.gz'
Content type 'application/x-gzip' length 905003 bytes (883 KB)
==================================================
downloaded 883 KB

* installing *source* package ‘RMariaDB’ ...
** package ‘RMariaDB’ successfully unpacked and MD5 sums checked
** using staged installation
Using PKG_CFLAGS=
Using PKG_LIBS=-lmysqlclient
Using PKG_PLOGR=
-----------------------------[ ANTICONF ]-----------------------------
Configure could not find suitable mysql/mariadb client library. Try installing:
 * deb: libmariadb-dev (Debian, Ubuntu)
 * rpm: mariadb-connector-c-devel | mariadb-devel | mysql-devel (Fedora, CentOS, RHEL)
 * csw: mysql56_dev (Solaris)
 * brew: mariadb-connector-c (OSX)
If you already have a mysql client library installed, verify that either
mariadb_config or mysql_config is on your PATH. If these are unavailable
you can also set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------[ ERROR MESSAGE ]----------------------------
<stdin>:1:10: fatal error: mysql.h: No such file or directory
compilation terminated.
-----------------------------------------------------------------------
ERROR: configuration failed for package ‘RMariaDB’
* removing ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2/RMariaDB’
Warning in install.packages :
  installation of package ‘RMariaDB’ had non-zero exit status

and trying to install libmariadb-dev gave:

Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘libmariadb-dev’ is not available for this version of R

Maybe I had tried something else, but the moment I installed all the missing LaTeX packages in issue #49, TxDb.Mmusculus.UCSC.mm10.knownGene installed afterwards.

Otherwise, I can definitely see the rationale to use FASTA files while forging BSgenome packages. Reading of a 9GB tarball file is wild! And I thought the tarball I generated, that clocked to almost 600 mbs was heavy! That 2bit indexes must be 4GB and less is definitely noted. This article was as enlightening as it was interesting. The bit comparing lungfish genetic elements to a computer virus "They multiply on their own but don't have a function. As a scientist you wonder why the 'genetic hard drive' of the lungfish has not crashed long ago given the high number of transposable elements" was amusing, and also made me realize that I had not considered the fact that all (or most) of these chromosomes had specific importance. I'm reading a bit more on this. Thank you for sharing the article, and for answering my questions!

hpages commented 1 year ago

and trying to install libmariadb-dev gave:
Installing package into ‘/home/dell/R/x86_64-pc-linux-gnu-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
  package ‘libmariadb-dev’ is not available for this version of R
libmariadb-dev is not an R package. It's a system library available as a Debian package! This error message:
Configure could not find suitable mysql/mariadb client library. Try installing:
* deb: libmariadb-dev (Debian, Ubuntu)
* rpm: mariadb-connector-c-devel | mariadb-devel | mysql-devel (Fedora, CentOS, RHEL)
* csw: mysql56_dev (Solaris)
* brew: mariadb-connector-c (OSX)
shows the name of the system component that needs to be installed for various operating systems. For yours (Ubuntu), it says * deb: libmariadb-dev (Debian, Ubuntu). deb stands for Debian package. Like all Debian packages, you need to install it at the Unix command line with sudo apt-get install <package-name>.

It's great that you were able to find a GFF file for Felis_catus_9.0 on the NCBI FTP site!

Since I thought it would be too complicated to find this file, I made up the little story below. Now that I'm finished writing it, I still share it with you, even if you don't really need this anymore (because you found the GFF file). Feel free to ignore.

Let's say that you work for a team of scientists that study the Cat genome. One day, one coworker announces that she's identified 6 new genes on this genome. She sends you an email with the following information:

Dear colleague,

Last week I identified 6 new genes on the Cat genome. I'm very excited! I used the Felis_catus_9.0 assembly
(GCF_000181335.3) for my work. Here are the genes I identified, with their positions w.r.t. the
Felis_catus_9.0 assembly:

    gene name  sequence                      start position   end position
    ---------  --------                      --------------   ------------
    gene1      chrC1                                 300501         360880
    gene2      chrC2                                1420991        1485760
    gene3      chrC2                                1985444        2075620
    gene4      chrC1_NW_019365375v1_random              935          15777
    gene5      chrC2_NW_019365398v1_random             4466          22888
    gene6      chrC2_NW_019365399v1_random             2801           9469

I would need the DNA sequences for those genes, which I think you should be able to obtain using
Bioconductor. Unfortunately I don't know how to use Bioconductor. Do you think you could help me
with that?
Kind regards

To help her with that, we're going to put the gene positions in a GRanges object:

library(GenomicRanges)

## We construct the GRanges manually. Note that our coworker would normally send the
## gene positions in a text file, typically a tabulated file, that we would import in R with
## read.table() and then convert to a GRanges object.
genes_ranges <- c(gene1="chrC2:300501-360880",
                  gene2="chrC2:1420991-1485760",
                  gene3="chrC2:1985444-2075620",
                  gene4="chrC1_NW_019365375v1_random:935-15777",
                  gene5="chrC2_NW_019365398v1_random:4466-22888",
                  gene6="chrC2_NW_019365399v1_random:2801-9469")
genes_ranges <- GRanges(genes_ranges)

genes_ranges
# GRanges object with 6 ranges and 0 metadata columns:
#                       seqnames          ranges strand
#                          <Rle>       <IRanges>  <Rle>
#   gene1                  chrC2   300501-360880      *
#   gene2                  chrC2 1420991-1485760      *
#   gene3                  chrC2 1985444-2075620      *
#   gene4 chrC1_NW_019365375v1..       935-15777      *
#   gene5 chrC2_NW_019365398v1..      4466-22888      *
#   gene6 chrC2_NW_019365399v1..       2801-9469      *
#   -------
#   seqinfo: 4 sequences from an unspecified genome; no seqlengths

seqinfo(genes_ranges)
# Seqinfo object with 4 sequences from an unspecified genome; no seqlengths:
#   seqnames                    seqlengths isCircular genome
#   chrC2                               NA         NA   <NA>
#   chrC1_NW_019365375v1_random         NA         NA   <NA>
#   chrC2_NW_019365398v1_random         NA         NA   <NA>
#   chrC2_NW_019365399v1_random         NA         NA   <NA>

We know that the gene positions are w.r.t. the Felis_catus_9.0 so let's fix the genome field above:

## Same as doing genome(seqinfo(genes_ranges)) <- "Felis_catus_9.0"
genome(genes_ranges) <- "Felis_catus_9.0"

seqinfo(genes_ranges)
# Seqinfo object with 4 sequences from Felis_catus_9.0 genome; no seqlengths:
#   seqnames                    seqlengths isCircular          genome
#   chrC2                               NA         NA Felis_catus_9.0
#   chrC1_NW_019365375v1_random         NA         NA Felis_catus_9.0
#   chrC2_NW_019365398v1_random         NA         NA Felis_catus_9.0
#   chrC2_NW_019365399v1_random         NA         NA Felis_catus_9.0

Then load BSgenome.Fcatus.UCSC.felCat9 and use getSeq() to try to extract the gene sequences. See how I call getSeq() in Presca's example and adapt to do the same here. Assuming you're calling getSeq() correctly, you'll still get an error like I got the first time I called getSeq() in Presca's example. Something about a sequence not found. That's because the sequence names in your genes_ranges are not found in Fcatus. You'll need to fix this by uniformizing the sequence names.

hpages commented 1 year ago

See https://github.com/Bioconductor/BSgenome/pull/46#issuecomment-1291424086 for the long due explanation about PkgExamples. Let me know if you have any questions.

hpages commented 1 year ago

Hi @kakopo ,

I just merged PR #47.

Don't miss my long due explanation about PkgExamples: https://github.com/Bioconductor/BSgenome/pull/46#issuecomment-1291424086 And do not hesitate to ask if you have any questions.

Next task in your group is #43. It's still about Cat! :cat2: Whenever you are ready, go there and ask to be assigned.

Don't forget to record your contributions on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

kakopo commented 1 year ago

Thank you. I was able to successfully install RMariaDB! The story was much appreciated too- now I know that the GRanges objects are done manually. I have also gone through the explanation about PkgExamples in #46 (comment) which was very insightful and helpful. I take it that this means that simply adding explanatory comments into the PkgExamples field would still show up as valid examples in the package.Rd file.

hpages commented 1 year ago

I take it that this means that simply adding explanatory comments into the PkgExamples field would still show up as valid examples in the package.Rd file.

Yes. If you check the man page for your BSgenome.Fcatus.UCSC.felCat9 package, you'll see that the genome$chrA1 # same as genome[["chrA1"]] line was injected as-is.