Forge BSgenome data package for UCSC genome canFam6

Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs

https://bioconductor.org/packages/BSgenome

7 stars 9 forks source link

Forge BSgenome data package for UCSC genome canFam6 #46

Closed Priceless-P closed 1 year ago

Priceless-P commented 1 year ago

I added the canFam6 seed file.

hpages commented 1 year ago

Thanks @Priceless-P

Can I ask you to rename the canFam6-seed file -> BSgenome.Cfamiliaris.UCSC.canFam6-seed? I forgot to mention this earlier, sorry for the inconvenience. Fixed now: https://github.com/Bioconductor/BSgenome/tree/master/inst/extdata/Outreachy

Also there's no need to put GenomicFeatures in Suggests. Some seed files in inst/extdata/GentlemanLab/ do this (e.g. BSgenome.Drerio.UCSC.danRer11-seed), but that's only because the examples in the PkgExamples field use GenomicFeatures (note that those examples will end up in the BSgenome.Drerio.UCSC.danRer11 package when calling forgeBSgenomeDataPkg() on that seed file).

Does the BSgenome.Cfamiliaris.UCSC.canFam6 package that you forged contain examples? Start R, load the package, and open the only man page in the package with ?BSgenome.Cfamiliaris.UCSC.canFam6 (?Cfamiliaris should work too). I think that by default it has some, but they are very generic. Maybe add a PkgExamples field in your seed file. It could be just a one-liner e.g. PkgExamples: genome$chr1 # same as genome[["chr1"]], like in BSgenome.Cjacchus.UCSC.calJac4-seed.

Note that in more recent versions of the seed files, we've renamed the provider_version field -> genome. See for example BSgenome.Cjacchus.UCSC.calJac4-seed. I believe that provider_version still works for backward compatibility, but it should be deprecated.

Here is an interesting experiment. Do seqinfo(Cfamiliaris). This should return exactly the same thing as getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE). You can check this with:

identical(seqinfo(Cfamiliaris), getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE))

This is no coincidence. When you forged the package with forgeBSgenomeDataPkg(), then getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) got called behind the scene to obtain the sequence names and circularity flags for canFam6, and to store them in the Cfamiliaris object. Because the registration file for canFam6 was carefully crafted to specify the circular sequences and to put the sequences in "canonical" order, the sequences in the Cfamiliaris object are also in canonical order, and the circularity flags are the expected ones.

By the way, specifying circ_seqs: "chrM" again in the seed file is probably not needed, because that knowledge is already available via getChromInfoFromUCSC(). Can you remove this and try to forge the package again? Does it work?

Another even more interesting experiment is the following. Right now the sequences in the Cfamiliaris object (a BSgenome object) are named according to the UCSC naming convention. We can see this with:

seqlevelsStyle(Cfamiliaris)  # UCSC

However, the sequences can be renamed to use the NCBI names. Try:

seqlevelsStyle(Cfamiliaris) <- "NCBI"
seqinfo(Cfamiliaris)

This works because of all the work you've done with registering canFam6 and Dog10K_Boxer_Tasha, and to link them together.

You should be able to switch back to the UCSC names with:

seqlevelsStyle(Cfamiliaris)  # NCBI
seqlevelsStyle(Cfamiliaris) <- "UCSC"
seqinfo(Cfamiliaris)

The ability to switch easily from one naming convention to the other is a very popular feature in Bioconductor. There has always been a strong demand for it. getChromInfoFromUCSC(.., map.NCBI=TRUE) is the workhorse behind it.

Finally note that all the above is not restricted to BSgenome objects. It also applies to other Bioconductor objects, like TxDb objects:

library(GenomicFeatures)
txdb <- makeTxDbFromUCSC("canFam6", tablename="ncbiRefSeq")
txdb
# TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: canFam6
# Organism: Canis familiaris
# Taxonomy ID: 9615
# UCSC Table: ncbiRefSeq
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 93774
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-10-20 15:39:46 -0700 (Thu, 20 Oct 2022)
# GenomicFeatures version at creation time: 1.49.7
# RSQLite version at creation time: 2.2.18
# DBSCHEMAVERSION: 1.2

seqinfo(txdb)
# Seqinfo object with 147 sequences (1 circular) from canFam6 genome:
#   seqnames             seqlengths isCircular  genome
#   chr1                  122014068      FALSE canFam6
#   chr2                   82037489      FALSE canFam6
#   chr3                   94329250      FALSE canFam6
#   chr4                   87912527      FALSE canFam6
#   chr5                   88913986      FALSE canFam6
#   ...                         ...        ...     ...
#   chrUn_NW_023329769v1      15679      FALSE canFam6
#   chrUn_NW_023329770v1       9700      FALSE canFam6
#   chrUn_NW_023329771v1        570      FALSE canFam6
#   chrUn_NW_023329772v1       2751      FALSE canFam6
#   chrUn_NW_023329773v1      95878      FALSE canFam6

identical(seqinfo(txdb), getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE))
# [1] TRUE

seqlevelsStyle(txdb)
# [1] "UCSC"

seqlevelsStyle(txdb) <- "NCBI"

seqinfo(txdb)
# Seqinfo object with 147 sequences (1 circular) from Dog10K_Boxer_Tasha genome:
#   seqnames seqlengths isCircular             genome
#   1         122014068      FALSE Dog10K_Boxer_Tasha
#   2          82037489      FALSE Dog10K_Boxer_Tasha
#   3          94329250      FALSE Dog10K_Boxer_Tasha
#   4          87912527      FALSE Dog10K_Boxer_Tasha
#   5          88913986      FALSE Dog10K_Boxer_Tasha
#   ...             ...        ...                ...
#   chrUn95       15679      FALSE Dog10K_Boxer_Tasha
#   chrUn96        9700      FALSE Dog10K_Boxer_Tasha
#   chrUn97         570      FALSE Dog10K_Boxer_Tasha
#   chrUn98        2751      FALSE Dog10K_Boxer_Tasha
#   chrUn99       95878      FALSE Dog10K_Boxer_Tasha

hpages commented 1 year ago

Oh looks like you renamed canFam6-seed -> BSgenome.Cfamiliaris.UCSC.canFam6-seed 2 seconds before I posted my previous comment. Maybe you were reading my mind? :wink:

Priceless-P commented 1 year ago

Oh looks like you renamed canFam6-seed -> BSgenome.Cfamiliaris.UCSC.canFam6-seed 2 seconds before I posted my previous comment. Maybe you were reading my mind? 😉

Lol. I wish I can. Actually, I just happened to be on the repo when I saw your update to the text file in the extdata/Outreachy folder telling us to use the package name with -seed attached to it as the name of the seed file, so I did. Didn't even know you were writing a comment for me.

This is no coincidence. When you forged the package with forgeBSgenomeDataPkg(), then getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) got called behind the scene to obtain the sequence names and circularity flags for canFam6, and to store them in the Cfamiliaris object.

It's beautiful how these functions are linked behind the scene and how you configured them to work together. When I took on this issue, I thought it would take so much to complete. But I was marveled to discover that just one function forgeBSgenomeDataPkg() actually does about 95% of the job! I wonder how it would have been without it. Please where can i find codebase of the forgeBSgenomeDataPkg() function? I would love to see how you made it work, If that is allowed of course.

Maybe add a PkgExamples field in your seed file. I could be just a one-liner e.g. PkgExamples: genome$chr1 # same as genome[["chr1"]], like in BSgenome.Cjacchus.UCSC.calJac4-seed.

I didn't realize PkgExamples had such an effect on the result of the man page. Actually, I intentionally omitted it initially because I didn't really know what genome$chr1 meant. Please do you mind explaining that to me? Meanwhile, I have added it to the seed file. It does make a lot of difference. I also removed cir-seq and changed provider_version to genome and everything works the same.

You should be able to switch back to the UCSC names with:
seqlevelsStyle(Cfamiliaris)  # NCBI
seqlevelsStyle(Cfamiliaris) <- "UCSC"
seqinfo(Cfamiliaris)
What might the use cases be? Why prefer one to another when the information provided by both UCSC and NCBI are publicly available?

Finally note that all the above is not restricted to BSgenome objects. It also applies to other Bioconductor objects, like TxDb objects:

I came across txdb a couple of times in the last few days and I didn't know what it meant. I'm glad you brought it up. So they are BSgenome objects. I wonder what the other BSgenome objects are. I know it might not really be convenient to answer all of my questions right here in the comment. Please can you recommend something like a video or article that covers everything you think I should know on a deeper level?

I learn so much every single time you drop a comment for me. It's a huge privilege to have this much access to you. I will make the most out of it. Thank you for all you do @hpages

hpages commented 1 year ago

Hi @Priceless-P

But I was marveled to discover that just one function forgeBSgenomeDataPkg() actually does about 95% of the job!

It's great that you felt that way :blush: However, note that this is an ideal situation, because the genome is already registered in GenomeInfoDb. Also the fact that we're dealing with an UCSC genome helps (those genomes are the easiest to deal with). Forging a BSgenome data package can get a a little bit more complicated when the genome is not registered, or if we're dealing with a non-UCSC genome (e.g. an NCBI assembly).

Please where can i find codebase of the forgeBSgenomeDataPkg() function? I would love to see how you made it work, If that is allowed of course.

Of course it's allowed. It's even encouraged! We have nothing to hide :wink: This is open source. That's why open source is so great!

You'll find the codebase for forgeBSgenomeDataPkg() in the BSgenome package, in the R/BSgenomeForge.R file. Note that if you type the name of a function at the R prompt, followed by <Enter>, you'll see the body of the function. For example:

getChromInfoFromUCSC
# function (genome, assembled.molecules.only = FALSE, map.NCBI = FALSE, 
#     add.ensembl.col = FALSE, goldenPath.url = getOption("UCSC.goldenPath.url"), 
#     recache = FALSE, as.Seqinfo = FALSE) 
# {
#     if (!isSingleString(genome)) 
#         stop(wmsg("'genome' must be a single string"))
#     if (!isTRUEorFALSE(assembled.molecules.only)) 
#         stop(wmsg("'assembled.molecules.only' must be TRUE or FALSE"))
#     ...
#     ...
#     if (!as.Seqinfo) 
#         return(ans)
#     Seqinfo(seqnames = ans[, "chrom"], seqlengths = ans[, "size"], 
#         isCircular = ans[, "circular"], genome = genome)
# }
# <bytecode: 0x562226084dc8>
# <environment: namespace:GenomeInfoDb>

The last line also shows in what package the function is defined.

For forgeBSgenomeDataPkg we see something quite different though:

forgeBSgenomeDataPkg
# standardGeneric for "forgeBSgenomeDataPkg" defined from package "BSgenome"
# 
# function (x, seqs_srcdir = ".", destdir = ".", verbose = TRUE) 
# standardGeneric("forgeBSgenomeDataPkg")
# <bytecode: 0x562216ceac90>
# <environment: 0x562216d0f860>
# Methods may be defined for arguments: x
# Use  showMethods(forgeBSgenomeDataPkg)  for currently available ones.

This tells us that forgeBSgenomeDataPkg is an S4 generic function defined in the BSgenome package.

S4 is a class system in R that allows OO programming (Object Oriented programming). In this class system, everything is about S4 classes, S4 generic functions, and S4 methods. We use those things a lot in Bioconductor. FWIW the S4Vectors package contains some slides that give a quick overview of the S4 class system. A link to those slides is available on the landing page of the S4Vectors package here.

So when we display the body of forgeBSgenomeDataPkg, we don't see much. That's because the code that does the work is in the methods associated with this generic function, and not in the generic function itself. The generic function only takes care of finding the appropriate method and calling it. This mechanism is called "method dispatch". In the case of forgeBSgenomeDataPkg, method dispatch will look at the class of the value supplied to the x argument (the first argument), and will select the method to call based on that class.

We can display the list of methods defined for that generic function with:

showMethods(forgeBSgenomeDataPkg)
# Function: forgeBSgenomeDataPkg (package BSgenome)
# x="BSgenomeDataPkgSeed"
# x="character"
# x="list"

Only 3 methods. One defined for BSgenomeDataPkgSeed objects, one for character objects (a.k.a. character vectors), and one for list objects.

To see a particular method, we can use selectMethod(). For example, to see the method defined for character vectors:

selectMethod(forgeBSgenomeDataPkg, signature="character")
# Method Definition:
# 
# function (x, seqs_srcdir = ".", destdir = ".", verbose = TRUE) 
# {
#     y <- .readSeedFile(x, verbose = verbose)
#     y <- as.list(y)
#     if (missing(seqs_srcdir)) {
#         seqs_srcdir <- y[["seqs_srcdir"]]
#         if (is.null(seqs_srcdir)) 
#             stop("'seqs_srcdir' argument is missing, and the ", 
#                 "'seqs_srcdir' field is missing in seed file")
#     }
#     y <- y[!(names(y) %in% "seqs_srcdir")]
#     forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir, 
#         verbose = verbose)
# }
# <bytecode: 0x562216cc1818>
# <environment: namespace:BSgenome>
# 
# Signatures:
#         x          
# target  "character"
# defined "character"

This shows us the body of the method, and in what package the method is defined.

There's a lot going on here. S4 is a rich and powerful class system. But it has some important differences with more "traditional" OO programming in other languages like Python, Java, or C++, that can make it a little bit harder to learn. It also has some unique (and very useful) features like S4 coercion methods for turning objects of a given class into objects of another class. There's a lot to discuss about it, but I will stop here for now.

This comment is starting to become quite long so I'll take a break and post it. But I'm not done commenting and answering your questions. I'll do this in the next comment.

Priceless-P commented 1 year ago

Okay. Thanks for the clarification. I definitely have a lot of studying to do.

How about my PR? Did I do something wrong?

hpages commented 1 year ago

Sorry for the wait @Priceless-P, you did nothing wrong, I just got busy with other duties.

Have you tried to remove the circ_seqs: "chrM" line from the seed file? Were you able to forge the package again without it? If that worked, then reinstall the package, start R, load the package, and check the circularity flags of the Cfamiliaris object with isCircular(Cfamiliaris) (this is a shortcut for isCircular(seqinfo(Cfamiliaris))). Do they look as expected? Note that the vector returned by isCircular(Cfamiliaris) is a named logical vector i.e. a logical vector with names on it. You can call which() on that vector to extract the positions of the TRUE values. See ?which in the base package for more information.

If you can confirm that having circ_seqs: "chrM" in the seed file is not needed, then please remove it.

About this field:

SrcDataFiles: canFam6.fa.gz from http://hgdownload.cse.ucsc.edu/goldenPath/canFam6/bigZips/

I wonder if this is really the file that you downloaded.

The extension in canFam6.fa.gz indicates that this is a compressed FASTA file. There's a long preamble displayed at https://hgdownload.cse.ucsc.edu/goldenPath/canFam6/bigZips/ above the directory listing itself. Here is what this long preamble says about canFam6.fa.gz:

canFam6.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

They don't say this is a FASTA file, but we know this because of the fa in fa.gz. If you've never heard of the FASTA format before, here's a link with some details.

If we look at the directory listing:

...
canFam6.chromAlias.txt  2022-09-08 22:19  7.5K  
canFam6.fa.gz           2021-05-24 10:53  725M  
canFam6.fa.masked.gz    2021-05-24 10:59  439M  
...

we see that canFam6.fa.gz is 725M. This is a big file! You probably noticed that when you downloaded it. Depending on the quality of your internet connection, it can take a long time to download.

Let's take a closer look at the file. Note that we can uncompress it with gunzip at the Unix command line:

gunzip canFam6.fa.gz

The file is big so the command will take some time to uncompress it. When it's done, canFam6.fa.gz will be replaced with canFam6.fa. Let's take a look at the size of the uncompressed file:

hpages@spectre:~$ ls -lh canFam6.fa
-rw-rw-r-- 1 hpages hpages 2.2G Oct 22 13:25 canFam6.fa

Not surprisingly, the file is even bigger now. About 3x bigger!

We didn't really need to uncompress the file. The only reason we did it is that now we can take a look at what's inside the file:

hpages@spectre:~$ head canFam6.fa
>chr1
ttttttttttttttttttttctcgtttttttttttttttttctttttttc
tttcttttttttttCCGTTCtgtgagccactgggcggacctcccgtgtat
ctgagtgataccgcgatttccaacgctttcatcaatcagagtagcccgtt
cctacccatgtccgaatgggtagcagacatttgcctcccatgcatacact
tctgagagttgagcttctggcctgtacctcccatcccgccgctggtaccc
ttggcttctcaaaggcctaggctcgctgtttgacggaagtgcttggagag
agaattggaatcaccggcacacaagactagtgcctccaagctcagtccag
cgatttcctagtaattccttggtagactggtgctacatactaagttccat
acgtgagagggtagctgaacgctctgtccaaagcatcttacttctgagag

This is a typical FASTA file. It's organized in records. Each record starts with a header line that contains the name of a sequence, immediately followed by the sequence itself. The first sequence in our file is the DNA sequence of the first Dog chromosome, named chr1 by the UCSC people. Note that FASTA header lines always start with > to distinguish them from the other lines in a record.

Let's extract all the header lines from the file:

hpages@spectre:~$ grep '>' canFam6.fa
>chr1
>chr10
>chr11
>chr12
>chr13
>chr14
...
>chrUn_NW_023329769v1
>chrUn_NW_023329770v1
>chrUn_NW_023329771v1
>chrUn_NW_023329772v1
>chrUn_NW_023329773v1
>chrX

We should see one record per chromosome in the canFam6 genome.

Let's count them:

hpages@spectre:~$ grep '>' canFam6.fa | wc --lines
147

That's exactly what we expected.

So all seems to be good with this file: it contains all the sequences of the canFam6 genome. In theory, that's all what we should need to forge BSgenome.Cfamiliaris.UCSC.canFam6. But the problem is that the file cannot be used as-is. According to the "How to forge a BSgenome data package" vignette:

The sequence data must be in a single twoBit file (e.g. musFur1.2bit) or in a collection of FASTA files (possibly gzip-compressed). If the latter, then you need 1 FASTA file per sequence that you want to put in the target package. In that case the name of each FASTA file must be of the form where is the name of the sequence in it and and are a prefix and a suffix (possibly empty) that are the same for all the FASTA files.

In other words, the forging process only supports the two following options for the sequence data: a single twoBit file (like canFam6.2bit located at https://hgdownload.cse.ucsc.edu/goldenPath/canFam6/bigZips/), or a collection of FASTA files with one file per sequence.

I wonder how you handled this @Priceless-P. Did you manage to split the big canFam6.fa,gz file into one file per sequence?

Priceless-P commented 1 year ago

Sorry for the wait @Priceless-P, you did nothing wrong, I just got busy with other duties.

Okay. I understand.

Yes, I removed circ_seqs: "chrM" and it worked the same. isCircular(Cfamiliaris) also works as excepted

I wonder if this is really the file that you downloaded.

Yes, that is the file I downloaded. I unzipped it the same way you did above and it. I spilt and added them to a folder with this command I found on the Bioconductor support page

> z <- readDNAStringSet("canFam6.fa")
> dir.create("canFam6")
> for(i in 1:38) writeXStringSet(z[i,], paste0("canFam6/", gsub("\\s+", "", names(z)[i], perl = TRUE), ".fa"))
> dir("canFam6/")

It was used for another genome but I did the necessary substitution. I also studied the code and I now understand how it works.

hpages commented 1 year ago

Awesome! It's amazing that you figured how to do this.

Here are some notes about this code:

This form of subsetting, z[i, ], is usually used to extract rows from a 2D object like a matrix or data.frame object. On a 2D object, dim() will return an integer vector of length 2. On a 1D object like an atomic vector, a factor, or a list, it will usually return NULL. Maybe the appearance of DNAStringSet object z is misleading, but DNAStringSet objects are actually considered 1D objects (you can check that dim(z) is NULL). More precisely, these objects are considered vector-like objects, like atomic vectors, factors, and list objects. The standard way to subset a vector-like object is to use z[i]. We call this form of subsetting the 1D form or the linear form.

It's not clear to me why you need this:

gsub("\\s+", "", names(z)[i], perl = TRUE)

It's possible that the code you found on the Bioconductor support site introduced this unnecessary complication, I don't know. But as you can see below, the names on z are "clean" and ready to be used as-is to form the names of the FASTA files that your for loop will write to disk:

head(names(z))
# [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14"

tail(names(z))
# [1] "chrUn_NW_023329769v1" "chrUn_NW_023329770v1" "chrUn_NW_023329771v1"
# [4] "chrUn_NW_023329772v1" "chrUn_NW_023329773v1" "chrX"

Also note that passing those names thru gsub("\\s+", "", ..., perl=TRUE) doesn't do anything to them:

gsub("\\s+", "", head(names(z)), perl=TRUE)
# [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14"

gsub("\\s+", "", tail(names(z)), perl=TRUE)
# [1] "chrUn_NW_023329769v1" "chrUn_NW_023329770v1" "chrUn_NW_023329771v1"
# [4] "chrUn_NW_023329772v1" "chrUn_NW_023329773v1" "chrX"

identical(gsub("\\s+", "", names(z), perl=TRUE), names(z))
# [1] TRUE

In other words, gsub("\\s+", "", ..., perl=TRUE) is a no-op on the sequence names that are in UCSC genome canFam6.

The full canFam6 genome must be included in BSgenome.Cfamiliaris.UCSC.canFam6. That is, all the sequences in canFam6 must be included in the package. This genome contains 147 sequences (as shown by getChromInfoFromUCSC(canFam6), and confirmed by length(z)). So your for loop should be:
```
for (i in seq_along(z)) ...
```
There's is no reason to pick up only the first 38 sequences in z.
File paths should always be constructed with the file.path() function, instead of using paste0() or paste(). Using file.path() is faster and more portable, and it also improves readability. See ?file.path for more information. Note that you can still use paste0() to make the file name, but then file.path() should be used the construct the file path. So your code will look like this: file.path("canFam6", paste0(names(z)[i], ".fa"))

So here's what I'll ask you to try:

Make the suggested changes to the code you used to split the big FASTA file.
Run the code again and check that you end up with 147 FASTA files in your canFam6/ folder.
Forge BSgenome.Cfamiliaris.UCSC.canFam6 again.
Install BSgenome.Cfamiliaris.UCSC.canFam6.
Start R, load BSgenome.Cfamiliaris.UCSC.canFam6, and inspect the Cfamiliaris object. Does it look like expected? In particular, do you confirm that seqinfo(Cfamiliaris) and getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) are identical? I don't think they were before because you apparently had only 38 sequences in Cfamiliaris.

As you can see, I'm using this task as an opportunity to introduce some key concepts and discuss a few other important things with you. There will be more. Hope you don't mind. (And I will answer your question about PkgExamples, I promess :pray:)

Priceless-P commented 1 year ago

Okay, @hpages . I will redo it like you said.

The full canFam6 genome must be included in BSgenome.Cfamiliaris.UCSC.canFam6. That is, all the sequences in canFam6 must be included in the package. This genome contains 147 sequences (as shown by getChromInfoFromUCSC(canFam6), and confirmed by length(z)).

Surprisingly for(i in 1:38) writeXStringSet(z[i,], paste0("canFam6/", gsub("\\s+", "", names(z)[i], perl = TRUE), ".fa")) actually included all the 147 chromosomes! I was expecting some errors myself when i used it but it worked. I just confirmed again

prisca@Priscas-MacBook-Air canFam6 % ls | wc -l
     147

Also, seqinfo(Cfamiliaris) returned this

> seqinfo(Cfamiliaris)
Seqinfo object with 147 sequences (1 circular) from canFam6 genome:
  seqnames             seqlengths isCircular  genome
  chr1                  122014068      FALSE canFam6
  chr2                   82037489      FALSE canFam6
  chr3                   94329250      FALSE canFam6
  chr4                   87912527      FALSE canFam6
  chr5                   88913986      FALSE canFam6
  ...                         ...        ...     ...
  chrUn_NW_023329769v1      15679      FALSE canFam6
  chrUn_NW_023329770v1       9700      FALSE canFam6
  chrUn_NW_023329771v1        570      FALSE canFam6
  chrUn_NW_023329772v1       2751      FALSE canFam6
  chrUn_NW_023329773v1      95878      FALSE canFam6
>

getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) also returns

getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE)
Seqinfo object with 147 sequences (1 circular) from canFam6 genome:
  seqnames             seqlengths isCircular  genome
  chr1                  122014068      FALSE canFam6
  chr2                   82037489      FALSE canFam6
  chr3                   94329250      FALSE canFam6
  chr4                   87912527      FALSE canFam6
  chr5                   88913986      FALSE canFam6
  ...                         ...        ...     ...
  chrUn_NW_023329769v1      15679      FALSE canFam6
  chrUn_NW_023329770v1       9700      FALSE canFam6
  chrUn_NW_023329771v1        570      FALSE canFam6
  chrUn_NW_023329772v1       2751      FALSE canFam6
  chrUn_NW_023329773v1      95878      FALSE canFam6
>

I think they are indeed identical.

Priceless-P commented 1 year ago

So I should refactor the code to

z <- readDNAStringSet("canFam6.fa")
for (i in seq_along(z))file.path("canFam6", paste0(names(z)[i], ".fa"))

I think I missed something. Because I just ran the command and the individual files weren't pasted.

Screen Shot 2022-10-24 at 7 58 53 AM

hpages commented 1 year ago

Surprisingly for(i in 1:38) writeXStringSet(z[i,], paste0("canFam6/", gsub("\\s+", "", names(z)[i], perl = TRUE), ".fa")) actually included all the 147 chromosomes! I was expecting some errors myself when i used it but it worked. I just confirmed again
prisca@Priscas-MacBook-Air canFam6 % ls | wc -l
     147
This code cannot produce 147 files. Maybe you've tried several other things while you were trying to figure out how to split the big FASTA file, and you probably managed to produce 147 files at some point. But that for(i in 1:38) loop cannot have done that. You can check this by deleting the content of your canFam6/ folder and try to run the code again.

When I run this code on my laptop, I end up with the following files in canFam6/:

hpages@spectre:~/canFam6$ ls
chr10.fa  chr15.fa  chr1.fa   chr24.fa  chr29.fa  chr33.fa  chr38.fa  chr7.fa
chr11.fa  chr16.fa  chr20.fa  chr25.fa  chr2.fa   chr34.fa  chr3.fa   chr8.fa
chr12.fa  chr17.fa  chr21.fa  chr26.fa  chr30.fa  chr35.fa  chr4.fa   chr9.fa
chr13.fa  chr18.fa  chr22.fa  chr27.fa  chr31.fa  chr36.fa  chr5.fa
chr14.fa  chr19.fa  chr23.fa  chr28.fa  chr32.fa  chr37.fa  chr6.fa

Then when I try to run forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed"), I get the following error:

> forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed")
Creating package in ./BSgenome.Cfamiliaris.UCSC.canFam6 
Loading 'chr1' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr1.fa' ... DONE
Loading 'chr2' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr2.fa' ... DONE
Loading 'chr3' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr3.fa' ... DONE
Loading 'chr4' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr4.fa' ... DONE
Loading 'chr5' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr5.fa' ... DONE
Loading 'chr6' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr6.fa' ... DONE
Loading 'chr7' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr7.fa' ... DONE
Loading 'chr8' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr8.fa' ... DONE
Loading 'chr9' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr9.fa' ... DONE
Loading 'chr10' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr10.fa' ... DONE
Loading 'chr11' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr11.fa' ... DONE
Loading 'chr12' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr12.fa' ... DONE
Loading 'chr13' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr13.fa' ... DONE
Loading 'chr14' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr14.fa' ... DONE
Loading 'chr15' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr15.fa' ... DONE
Loading 'chr16' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr16.fa' ... DONE
Loading 'chr17' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr17.fa' ... DONE
Loading 'chr18' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr18.fa' ... DONE
Loading 'chr19' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr19.fa' ... DONE
Loading 'chr20' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr20.fa' ... DONE
Loading 'chr21' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr21.fa' ... DONE
Loading 'chr22' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr22.fa' ... DONE
Loading 'chr23' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr23.fa' ... DONE
Loading 'chr24' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr24.fa' ... DONE
Loading 'chr25' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr25.fa' ... DONE
Loading 'chr26' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr26.fa' ... DONE
Loading 'chr27' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr27.fa' ... DONE
Loading 'chr28' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr28.fa' ... DONE
Loading 'chr29' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr29.fa' ... DONE
Loading 'chr30' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr30.fa' ... DONE
Loading 'chr31' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr31.fa' ... DONE
Loading 'chr32' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr32.fa' ... DONE
Loading 'chr33' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr33.fa' ... DONE
Loading 'chr34' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr34.fa' ... DONE
Loading 'chr35' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr35.fa' ... DONE
Loading 'chr36' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr36.fa' ... DONE
Loading 'chr37' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr37.fa' ... DONE
Loading 'chr38' sequence from FASTA file '/home/hpages/github/Priceless-P/canFam6/chr38.fa' ... DONE
Error in getSeqSrcpaths(seqname, prefix = prefix, suffix = suffix, seqs_srcdir = seqs_srcdir) : 
  file(s) not found: /home/hpages/github/Priceless-P/canFam6/chrX.fa

I think I missed something. Because I just ran the command and the individual files weren't pasted.

file.path("canFam6", paste0(names(z)[i], ".fa")) only constructs the file paths of the destination files. You still need to write the files with writeXStringSet().

Priceless-P commented 1 year ago

This code cannot produce 147 files. Maybe you've tried several other things while you were trying to figure out how to split the big FASTA file, and you probably managed to produce 147 files at some point. But that for(i in 1:38) loop cannot have done that. You can check this by deleting the content of your canFam6/ folder and try to run the code again.

You are absolutely right. Indeed I tried several other things while trying to figure it out, I'm sorry for the mix-up.

So here's what I'll ask you to try:

Make the suggested changes to the code you used to split the big FASTA file.

Run the code again and check that you end up with 147 FASTA files in your canFam6/ folder.

Forge BSgenome.Cfamiliaris.UCSC.canFam6 again.

Install BSgenome.Cfamiliaris.UCSC.canFam6.

Start R, load BSgenome.Cfamiliaris.UCSC.canFam6, and inspect the Cfamiliaris object. Does it look like expected? In particular, do you confirm that seqinfo(Cfamiliaris) and getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) are identical?

All checked. And yes, seqinfo(Cfamiliaris) and getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) produces the same results.

I'm sorry for the stress🥺

hpages commented 1 year ago

No stress :relaxed: Those glitches are totally expected and are part of the learning experience. Furthermore, they're never a waste of time because we always learn a lot from them :nerd_face:

All checked. And yes, seqinfo(Cfamiliaris) and getChromInfoFromUCSC("canFam6", as.Seqinfo=TRUE) produces the same results.

Excellent! Make sure to also try all the other things discussed earlier like isCircular(Cfamiliaris), which(isCircular(Cfamiliaris)), switch the sequences names in Cfamiliaris back and forth between UCSC and NCBI with the seqlevelsStyle() setter, do it again on the TxDb object obtained with makeTxDbFromUCSC("canFam6", tablename="ncbiRefSeq"), etc...

Earlier you asked what might the use cases be for switching between UCSC and NCBI names. This is a great question, thanks for asking! The typical use case is to harmonize the sequence names between two datasets. There's often the need to do this when a bioinformatician is in possession of two datasets that are associated with the same genome assembly, but one dataset uses the UCSC names and the other one the NCBI names.

For example, suppose that a bioinformatician has forged and installed BSgenome.Cfamiliaris.UCSC.canFam6, but a coworker sent them a GFF file that uses the NCBI sequence names. GFF files are files that contain annotations for a given genome assembly, like gene positions and other genomic features. See here and here to learn more about these files. More precisely, let's say that a coworker decided that they want to work with the GCF_000002285.5_Dog10K_Boxer_Tasha_genomic.gff.gz file located in the "FTP directory for RefSeq assembly" for Dog10K_Boxer_Tasha (you'll find that link on the right side of the NCBI page of the Dog10K_Boxer_Tasha assembly, below the link to the "Full sequence report"). Let's import this GFF file in R:

library(GenomicFeatures)
gff_file <- "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.5_Dog10K_Boxer_Tasha/GCF_000002285.5_Dog10K_Boxer_Tasha_genomic.gff.gz"
txdb <- makeTxDbFromGFF(gff_file)

The resulting object is a TxDb object:

> class(txdb)
[1] "TxDb"
attr(,"package")
[1] "GenomicFeatures"

Let's take a closer look at this object:

txdb
# TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.5_Dog10K_Boxer_Tasha/GCF_000002285.5_Dog10K_Boxer_Tasha_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 99233
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-10-24 13:18:40 -0700 (Mon, 24 Oct 2022)
# GenomicFeatures version at creation time: 1.49.7
# RSQLite version at creation time: 2.2.18
# DBSCHEMAVERSION: 1.2

seqinfo(txdb)
# Seqinfo object with 43 sequences from an unspecified genome; no seqlengths:
#   seqnames       seqlengths isCircular genome
#   NC_002008.4          <NA>       <NA>   <NA>
#   NC_006583.4          <NA>       <NA>   <NA>
#   NC_006584.4          <NA>       <NA>   <NA>
#   NC_006585.4          <NA>       <NA>   <NA>
#   NC_006586.4          <NA>       <NA>   <NA>
#   ...                   ...        ...    ...
#   NC_006620.4          <NA>       <NA>   <NA>
#   NC_006621.4          <NA>       <NA>   <NA>
#   NW_023329670.1       <NA>       <NA>   <NA>
#   NW_023329727.1       <NA>       <NA>   <NA>
#   NW_023329745.1       <NA>       <NA>   <NA>

We know that the GFF file is associated with the Dog10K_Boxer_Tasha assembly so let's fix the genome field above:

genome(txdb) <- "Dog10K_Boxer_Tasha"  # same as doing genome(seqinfo(txdb)) <- "Dog10K_Boxer_Tasha"
seqinfo(txdb)
# Seqinfo object with 43 sequences from Dog10K_Boxer_Tasha genome; no seqlengths:
#   seqnames       seqlengths isCircular             genome
#   NC_002008.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NC_006583.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NC_006584.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NC_006585.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NC_006586.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   ...                   ...        ...                ...
#   NC_006620.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NC_006621.4          <NA>       <NA> Dog10K_Boxer_Tasha
#   NW_023329670.1       <NA>       <NA> Dog10K_Boxer_Tasha
#   NW_023329727.1       <NA>       <NA> Dog10K_Boxer_Tasha
#   NW_023329745.1       <NA>       <NA> Dog10K_Boxer_Tasha

The names in the seqnames field are not even the official NCBI sequence names of the Dog10K_Boxer_Tasha assembly. They are something else:

seqlevelsStyle(txdb)
# [1] "RefSeq"

But it doesn't matter because we can switch them to the official NCBI or UCSC sequence names with either seqlevelsStyle(txdb) <- "NCBI" or seqlevelsStyle(txdb) <- "UCSC". Let's not do it now. Let's continue with the RefSeq names and see why they're going to be a problem.

Now, as part of their analysis, our bioinformatician needs to obtain the gene sequences for this assembly. The standard way to do this is to first obtain the gene positions (sometimes called "gene ranges"):

gene_ranges <- genes(txdb)
gene_ranges
# GRanges object with 39993 ranges and 1 metadata column:
#              seqnames             ranges strand |     gene_id
#                 <Rle>          <IRanges>  <Rle> | <character>
#      A1BG NC_006583.4 99988667-100002163      + |        A1BG
#      A1CF NC_006608.4  36912827-36995367      + |        A1CF
#     A2ML1 NC_006609.4    9742223-9772407      + |       A2ML1
#   A3GALT2 NC_006584.4  64695829-64716770      + |     A3GALT2
#    A4GALT NC_006592.4  22688321-22712503      + |      A4GALT
#       ...         ...                ...    ... .         ...
#    ZYG11A NC_006587.4  56047405-56132932      - |      ZYG11A
#    ZYG11B NC_006597.4    8748604-8835164      - |      ZYG11B
#       ZYX NC_006598.4    7292681-7301847      - |         ZYX
#     ZZEF1 NC_006587.4  30330515-30431700      - |       ZZEF1
#      ZZZ3 NC_006588.4  71838616-71941048      + |        ZZZ3
#   -------
#   seqinfo: 43 sequences from Dog10K_Boxer_Tasha genome; no seqlengths

Then, load the BSgenome data package that contains the DNA sequences of this assembly and use getSeq() to extract the gene sequences:

library(BSgenome.Cfamiliaris.UCSC.canFam6)

getSeq(Cfamiliaris, gene_ranges)
# Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
#   sequence NC_006583.4 not found
# In addition: Warning message:
# In .merge_two_Seqinfo_objects(x, y) :
#   The 2 combined objects have no sequence levels in common. (Use
#   suppressWarnings() to suppress this warning.)

The reason this doesn't work is because the sequence names in gene_ranges are not found in Cfamiliaris. We fix this by harmonizing the sequence names:

seqlevelsStyle(gene_ranges) <- seqlevelsStyle(Cfamiliaris)  # harmonize the sequence names

gene_ranges
# GRanges object with 39993 ranges and 1 metadata column:
#           seqnames             ranges strand |     gene_id
#              <Rle>          <IRanges>  <Rle> | <character>
#      A1BG     chr1 99988667-100002163      + |        A1BG
#      A1CF    chr26  36912827-36995367      + |        A1CF
#     A2ML1    chr27    9742223-9772407      + |       A2ML1
#   A3GALT2     chr2  64695829-64716770      + |     A3GALT2
#    A4GALT    chr10  22688321-22712503      + |      A4GALT
#       ...      ...                ...    ... .         ...
#    ZYG11A     chr5  56047405-56132932      - |      ZYG11A
#    ZYG11B    chr15    8748604-8835164      - |      ZYG11B
#       ZYX    chr16    7292681-7301847      - |         ZYX
#     ZZEF1     chr5  30330515-30431700      - |       ZZEF1
#      ZZZ3     chr6  71838616-71941048      + |        ZZZ3
#   -------
#   seqinfo: 43 sequences from canFam6 genome; no seqlengths

getSeq(Cfamiliaris, gene_ranges)
# DNAStringSet object of length 39993:
#           width seq                                         names               
#     [1]   13497 CCACCCAGGGATCCTGAATT...TGTGAATAGAATCTTTAAAA A1BG
#     [2]   82541 ATTGTAAACCAAACCTAAAG...ATAAAGTTTTAAAAGTACAA A1CF
#     [3]   30185 TTAGAATTCTATCACAGACT...TTTTGACCCAGGAATGCCTG A2ML1
#     [4]   20942 AAAAAAAAAAAGCTTCCCAG...ATAAAACTATTCCAAAAGTG A3GALT2
#     [5]   24183 CACTGCCTCCCGCCGCCGCC...AAGTGATTCCCTGACACTGA A4GALT
#     ...     ... ...
# [39989]   85528 GGGAACCCGATGTGGGACTC...GAATCGAGTCCTGCATCAGG ZYG11A
# [39990]   86561 CGTAGGTAACGAGGGCCGGG...TCTCGAAGGTGCAATGGAAA ZYG11B
# [39991]    9167 AGGCTGCTCGCGGCCCGGAC...AAAAGACCTGTTTTCCAAAA ZYX
# [39992]  101186 ATGGGGAACGCTCCGAGTCA...TTCTATATATAAAAATACAA ZZEF1
# [39993]  102433 GCAGGTCGAGGCGGCGGCGT...AATAAATAGAAAAAGAGTTA ZZZ3

Note that this harmonization can also be done the other way around, that is, by doing seqlevelsStyle(Cfamiliaris) <- seqlevelsStyle(gene_ranges) instead of seqlevelsStyle(gene_ranges) <- seqlevelsStyle(Cfamiliaris).

Bottom line: The need to harmonize the sequence names between two datasets is a very common situation. And registering and linking a UCSC/NCBI pair like you did for canFam6/Dog10K_Boxer_Tasha will make this work seamlessly for any datasets that are associated with this genome assembly.

I came across txdb a couple of times in the last few days and I didn't know what it meant. I'm glad you brought it up. So they are BSgenome objects.

The txdb variable that I showed in the above examples is a TxDb object. You can see this with class(txdb). TxDb objects are very different from BSgenome objects. The former contain the positions of various genomic features of interest w.r.t. to a given genome assembly (e.g. gene positions, transcript positions, exon positions), and the latter contain the DNA sequences of a given genome assembly. But both support seqinfo() and seqlevelsStyle().

Back to the process of forging BSgenome.Cfamiliaris.UCSC.canFam6: Have you considered using canFam6.2bit intead of canFam6.fa.gz to forge the package? Here's what the long preamble displayed at https://hgdownload.cse.ucsc.edu/goldenPath/canFam6/bigZips/ says about this file:

canFam6.2bit - contains the complete dog/canFam6 genome sequence
    in the 2bit file format.  Repeats from RepeatMasker and Tandem Repeats
    Finder (with period of 12 or less) are shown in lower case; non-repeating
    sequence is shown in upper case.  The utility program, twoBitToFa (available
    from the kent src tree), can be used to extract .fa file(s) from
    this file.  A pre-compiled version of the command line tool can be
    found at:
        http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
    See also:
        http://genome.ucsc.edu/admin/git.html
    http://genome.ucsc.edu/admin/jk-install.html

The 2bit file format is a modern alternative to the FASTA format for storing DNA sequences. 2bit files tend to be smaller than their FASTA counterpart, and they allow much faster random access. As mentioned in the "How to forge a BSgenome data package" vignette, the 2bit file provided by UCSC can be used to forge a BSgenome data package. It actually makes the process easier because, unlike with the big FASTA file, the 2bit file doesn't need to be split into a collection of smaller files (one small file per sequence). Using a 2bit file instead of a FASTA file, whenever possible, is actually the preferred way to go.

Priceless-P commented 1 year ago

Excellent! Make sure to also try all the other things discussed earlier like isCircular(Cfamiliaris), which(isCircular(Cfamiliaris)), switch the sequences names in Cfamiliaris back and forth between UCSC and NCBI with the seqlevelsStyle() setter, do it again on the TxDb object obtained with makeTxDbFromUCSC("canFam6", tablename="ncbiRefSeq"), etc...

Yes, I tried them. They all worked as expected. I used makeTxDbFromUCSC("canFam6", tablename="ncbiRefSeq") for the first time. I realized that makeTxDbFromUCSC()is under RMariaDB package, so I had to install it.

Bottom line: The need to harmonize the sequence names between two datasets is a very common situation. And registering and linking a UCSC/NCBI pair like you did for canFam6/Dog10K_Boxer_Tasha will make this works seamlessly for any datasets that are associated with this genome assembly.

This makes a lot of sense. Thank you for explaining it to me.

Back to the process of forging BSgenome.Cfamiliaris.UCSC.canFam6: Have you considered using canFam6.2bit instead of canFam6.fa.gz to forge the package? Here's what the long preamble displayed at https://hgdownload.cse.ucsc.edu/goldenPath/canFam6/bigZips/ says about this file:

You mentioned it in this vignette, and actually it was the first file I downloaded. But somehow i was still expecting to make a file per chromosome from the 2bit. Apparently i didn't really understand how it worked. I also got seqs_srcdir wrong (this is what i had: seqs_srcdir: /Users/prisca/Documents/canFam6.2bit instead of just seqs_srcdir: /Users/prisca/Documents) and didn't include seqfile_name: canFam6.2bit in the seed file. Of course it didn't work. So I decided to go for the fa.gz because i found more support online on how to go about splitting the files. But with what I have learned so far, I tried it again (using canFam6.2bit) and it worked. I noticed that BSgenome.Cfamiliaris.UCSC.canFam6/inst/exdata that was produced contains only one file single_sequences.2bit and that this file is enough as it somehow contains all the 147 chromosomes! The process is roughly about 3 times faster than using fa.gz

Thank you for opening my eyes to this.

hpages commented 1 year ago

I tried it again (using canFam6.2bit) and it worked.

:rocket:

I noticed that BSgenome.Cfamiliaris.UCSC.canFam6/inst/exdata that was produced contains only one file single_sequences.2bit and that this file is enough as it somehow contains all the 147 chromosomes!

Yes, what gets included in the BSgenome data package is a 2bit file, independently of whether the input sequence data was in a collection of FASTA files or in a 2bit file. This file is produced by forgeBSgenomeDataPkg(), by processing the input sequence data, and is always named single_sequences.2bit (not the best name ever, but there's a long history behind that). It contains the DNA sequences for all the chromosomes and scaffolds in the genome assembly.

Note that when the input sequence data is in a 2bit file, the single_sequences.2bit file produced by forgeBSgenomeDataPkg() is very similar to the input file. The content of the two files is actually the same, just in a different order (forgeBSgenomeDataPkg() reorders the sequence if needed).

Would you mind updating your seed file to use canFam6.2bit instead of canFam6.fa.gz? There's nothing wrong with using the latter, and you learned a lot by trying to do so. The only reason I ask you to make this change is that it will make the seed file easier to use if someone wants to forge BSgenome.Cfamiliaris.UCSC.canFam6 in the future. They'll just need to download the 2bit file and point seqs_srcdir to the folder where the file was downloaded. This extreme simplicity is why using the 2bit file provided by UCSC is preferred.

Thanks!

Priceless-P commented 1 year ago

Would you mind updating your seed file to use canFam6.2bit instead of canFam6.fa.gz?

Yes, gladly!

hpages commented 1 year ago

Thank you.

About the PkgExamples field in the seed file

The PkgExamples field can be used to add examples to the man page of the package that we want to forge. The field must contain valid R code. But before I continue on this, let's start with some background about how forgeBSgenomeDataPkg() works.

Some background about how the forgeBSgenomeDataPkg() function works

The forgeBSgenomeDataPkg() function uses a package template to create the new package. The package template is in the inst/pkgtemplates/ folder of the BSgenome package. This folder contains 2 subfolders: BSgenome_datapkg/ and MaskedBSgenome_datapkg/. One is the package template for BSgenome data packages, and the other one is the package template for masked BSgenome data packages. forgeBSgenomeDataPkg() uses the first one (BSgenome_datapkg/).

Click on the BSgenome_datapkg/ folder. What we see inside BSgenome_datapkg/ looks a lot like what we find in a usual R package. We see familiar things like the DESCRIPTION and NAMESPACE files, and the R/ and man/ folders.

Click on the DESCRIPTION file. The DESCRIPTION of an R package follows the DCF format (Debian Control File) i.e it's a collection of key/value pairs. See the "1.1.1 The DESCRIPTION file" section in the "Writing R Extensions" manual here for more information about what a DESCRIPTION file should look like. Note that this manual is the primary documentation for developing an R package. It can be a little bit intimidating and it's not meant to be read in a linear way. There's a lot of literature around on this topic that is more accessible and tries to make the reading a little more engaging. However, the "Writing R Extensions" manual remains the ultimate reference on that topic.

As you can see, the DESCRIPTION file in the BSgenome_datapkg/ template contains expected fields like Package, Title, Description, Version, Depends, etc.. But the values for those fields don't seem valid:

Package: @PKGNAME@
Title: @PKGTITLE@
Description: @PKGDESCRIPTION@
Version: @PKGVERSION@
...

In fact, those values (e.g. @PKGNAME@) are placeholders for the real values. More precisely, when we call forgeBSgenomeDataPkg() on a seed file, the function makes a copy of the BSgenome_datapkg/ template, renames it, and injects the real values in the DESCRIPTION file. The real values are taken from the seed file.

Let's go one level up and click on the man/ folder. There's only one .Rd file in this folder. .Rd files are R documentation files. Every R package has a man/ folder with .Rd files in it, one for each man page in the package. There's an entire section in the "Writing R Extensions" manual dedicated to .Rd files and what they should contain. See here.

Click on the package.Rd file. We see a lot of placeholders again in this file. Some of them are the same that we saw in the DESCRIPTION file e.g. @PKGNAME@, @PKGTITLE@, @PKGDESCRIPTION@, etc... Any file in the BSgenome_datapkg/ template can contain placeholders. All of them will be replaced by a real value when we call forgeBSgenomeDataPkg() on a seed file.

Every man page should have an \examples section with valid R code in it. This section is usually the last section in the man page. Find the \examples section is this man page. It starts with the following lines:

@PKGNAME@
genome <- @PKGNAME@
head(seqlengths(genome))
@PKGEXAMPLES@
...

So when we call forgeBSgenomeDataPkg() on your BSgenome.Cfamiliaris.UCSC.canFam6-seed file, this will become:

BSgenome.Cfamiliaris.UCSC.canFam6
genome <- BSgenome.Cfamiliaris.UCSC.canFam6
head(seqlengths(genome))
genome$chr1
...

Note that the genome$chr1 line comes from your seed file. If you look at the package.Rd file in the BSgenome.Cfamiliaris.UCSC.canFam6 package that you forged, that's what you're going to see.

Back to the PkgExamples field

So that is what the PkgExamples field is for. We use it to inject some additional lines of code in the examples section of the man page. Note that the fact that we can use genome to refer to the BSgenome object contained in the package is because of this line placed above the @PKGEXAMPLES@ line:

genome <- @PKGNAME@

Which becomes this line after injection:

genome <- BSgenome.Cfamiliaris.UCSC.canFam6

Finally note that:

If we don't define the PkgExamples field in the seed file, then nothing gets injected. We still end up with a valid \examples section in the BSgenome data package, but it has one less line. It's also very generic. The benefit of adding something like genome$chr1 in the man page is that it will show the user how to load a chromosome sequence in memory (the chromosome sequences are stored on disk and loaded on demand). Also R CMD check runs all the examples in a package so it will test that genome$chr1 actually works and fail if it doesn't, which is a good thing.
We can inject several lines in the \examples section via PkgExamples. See for example this seed file. Note that this seed file is old and we no longer put that kind of long examples in our more recent seed files.

Priceless-P commented 1 year ago

Thank you very much for the detailed explanation. It's very helpful.