Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Proposed task for Outreachy applicants: Forge BSgenome data package for NCBI assembly Dog10K_Boxer_Tasha #39

Closed hpages closed 1 year ago

hpages commented 1 year ago

This task depends on this issue and issue #38 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.

BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Amellifera.NCBI.AmelHAv3.1 is a BSgenome data package that contains the genomic sequences of NCBI assembly Amel_HAv3.1 (RefSeq assembly accession: GCF_003254395.2). Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.

This task's goal is to make a new BSgenome data package for NCBI assembly Dog10K_Boxer_Tasha. The process of making such package is partially documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.

Here's some additional information

For an NCBI assembly, the sequence data can be found in the form of compressed FASTA files located in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly associated with the assembly. You'll find the links to those FTP directories on the right side of the NCBI page of the Dog10K_Boxer_Tasha assembly, below the link to the Full sequence report.

Note that NCBI uses the .fna suffix for uncompressed FASTA files (UCSC uses the .fa suffix). For compressed FASTA files, NCBI and UCSC both add the .gz suffix to the .fna or .fa suffix. As you will see, NCBI can provide several .fna.gz files in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly.

The file that contains all the DNA sequences for the assembly is expected to have a name that looks like this: <RefSeq assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for RefSeq assembly, and <GenBank assembly accession>_<assembly name>_genomic.fna.gz in the FTP directory for GenBank assembly.

How you will proceed

The first thing you should do is download the appropriate FASTA file. Choose the RefSeq one if Dog10K_Boxer_Tasha was registered with its RefSeq assembly accession, or the GenBank one if it was registered with its GenBank assembly accession.

Then load it into R with the readDNAStringSet() function from the Biostrings package. The object returned by readDNAStringSet() is a DNAStringSet object. Take a look at the names on this object (call names() on the object to extract them, or call head(names()) to look at the 6 first names only). As you can see, these names are long and a little bit cryptic. They don't exactly look like the sequence names returned by getChromInfoFromNCBI("Dog10K_Boxer_Tasha") :cry: We'll take care of this below.

The fasta_to_sorted_2bit.R script

As we know, the forging process only supports the two following options for the input sequence data: a single 2bit file, or a collection of FASTA files with one file per sequence. This means that we won't be able to use the <assembly accession>_<assembly name>_genomic.fna.gz file as-is. We will first need to preprocess the file as follow: (1) either convert the file to the 2bit format, or (2) split the file into a collection of smaller FASTA files where each file contains only one sequence. It is recommended to do (1).

Here are some examples of R scripts that have been used to perform this conversion for other NCBI assemblies:

Note that these scripts also take care of two important things:

  1. They rename the sequences, that is, they replace the long and cryptic names on the DNAStringSet object returned by readDNAStringSet() with the sequence names returned by getChromInfoFromNCBI().
  2. They put the sequences in the same order as in the result returned by getChromInfoFromNCBI().

Choose one script, download it or copy it to your computer, rename it fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R, and adapt it to make it work for your <assembly accession>_<assembly name>_genomic.fna.gz file. When you work on this, it will help you if you try to execute the script line by line, and take the time to study what each line is doing. Don't hesitate to inspect each intermediate result. Name the output file produced by your script Dog10K_Boxer_Tasha.sorted.2bit.

Once the script is ready, please add it to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

The seed file

The seed file that you're going to prepare for the forging process should point to the 2bit file that you produced with your fasta_to_sorted_2bit.R script above.

Prepare the file and use it to forge the BSgenome data package.

Test the new package like you did with BSgenome.Cfamiliaris.UCSC.canFam6. That is: install it, start R, load the package, and play around with it. Proceed with some "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.

Once you are confident that your new BSgenome data package works as expected, please add your seed file to your fork of the BSgenome package, in the inst/extdata/Outreachy/ folder.

Finally commit, push, and submit a PR.

Priceless-P commented 1 year ago

Hi @hpages, please can I work on this issue?

hpages commented 1 year ago

Absolutely @Priceless-P !

hpages commented 1 year ago

Also don't forget to record https://github.com/Bioconductor/BSgenome/pull/46 on Outreachy.

Priceless-P commented 1 year ago

Also don't forget to record https://github.com/Bioconductor/BSgenome/pull/46 on Outreachy.

Okay thank you @hpages

I have forged BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha successfully. But while testing I discovered that

identical(seqinfo(Cfamiliaris), getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE))

returns FALSE. I have tried to find out what I did wrong but I haven't 😥.

seqinfo(Cfamiliaris) does return NCBI.

This is the result of forgeBSgenomeDataPkg()

> forgeBSgenomeDataPkg("Dog10K_Boxer_Tasha-seed")
Creating package in ./BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha 
Copying '/Users/prisca/Desktop/new/bsgenome/Dog10K_Boxer_Tasha.sorted.2bit' to './BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/inst/extdata/single_sequences.2bit' ... DONE

This is the result of R CMD build

Priscas-MacBook-Air:bsgenome prisca$ R CMD build BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha* checking for file ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/DESCRIPTION’ ... OK
* preparing ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1.tar.gz’

This is the result of R CMD check

Priscas-MacBook-Air:bsgenome prisca$ R CMD check BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1.tar.gz
* using log directory ‘/Users/prisca/Desktop/new/bsgenome/BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha.Rcheck’
* using R version 4.2.1 (2022-06-23)
* using platform: x86_64-apple-darwin17.0 (64-bit)
* using session charset: UTF-8
* checking for file ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/DESCRIPTION’ ... OK
* this is package ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’ version ‘0.0.1’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’ can be installed ... OK
* checking installed package size ... NOTE
  installed size is 556.2Mb
  sub-directories of 1Mb or more:
    extdata  556.1Mb
* checking package directory ... OK
* checking DESCRIPTION meta-information ... NOTE
Package listed in more than one of Depends, Imports, Suggests, Enhances:
  ‘BSgenome’
A package should be listed in only one of these fields.
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... NOTE
prepare_Rd: package.Rd:14-16: Dropping empty section \details
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking PDF version of manual ... OK
* DONE

Status: 3 NOTEs
See
  ‘/Users/prisca/Desktop/new/bsgenome/BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha.Rcheck/00check.log’

And finally sessionInfo()

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1
 [2] BSgenome_1.65.2                                 
 [3] rtracklayer_1.57.0                              
 [4] Biostrings_2.65.6                               
 [5] XVector_0.37.1                                  
 [6] GenomicRanges_1.49.1                            
 [7] GenomeInfoDb_1.33.12                            
 [8] IRanges_2.31.2                                  
 [9] S4Vectors_0.35.4                                
[10] BiocGenerics_0.43.4                             

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9                  lattice_0.20-45             prettyunits_1.1.1          
 [4] Rsamtools_2.13.4            ps_1.7.1                    utf8_1.2.2                 
 [7] digest_0.6.29               mime_0.12                   R6_2.5.1                   
[10] evaluate_0.17               highr_0.9                   pillar_1.8.1               
[13] zlibbioc_1.43.0             rlang_1.0.6                 rstudioapi_0.14            
[16] miniUI_0.1.1.1              callr_3.7.2                 urlchecker_1.0.1           
[19] Matrix_1.5-1                devtools_2.4.5              BiocParallel_1.31.13       
[22] stringr_1.4.1               htmlwidgets_1.5.4           RCurl_1.98-1.9             
[25] shiny_1.7.2                 DelayedArray_0.23.2         xfun_0.33                  
[28] compiler_4.2.1              httpuv_1.6.6                pkgconfig_2.0.3            
[31] pkgbuild_1.3.1              htmltools_0.5.3             SummarizedExperiment_1.27.3
[34] tibble_3.1.8                GenomeInfoDbData_1.2.9      codetools_0.2-18           
[37] matrixStats_0.62.0          XML_3.99-0.11               fansi_1.0.3                
[40] crayon_1.5.2                later_1.3.0                 GenomicAlignments_1.33.1   
[43] bitops_1.0-7                grid_4.2.1                  xtable_1.8-4               
[46] lifecycle_1.0.3             magrittr_2.0.3              cli_3.4.1                  
[49] stringi_1.7.8               cachem_1.0.6                fs_1.5.2                   
[52] promises_1.2.0.1            remotes_2.4.2               vctrs_0.4.2                
[55] ellipsis_0.3.2              rjson_0.2.21                restfulr_0.0.15            
[58] tools_4.2.1                 Biobase_2.57.1              glue_1.6.2                 
[61] purrr_0.3.5                 MatrixGenerics_1.9.1        processx_3.7.0             
[64] pkgload_1.3.0               parallel_4.2.1              fastmap_1.1.0              
[67] yaml_2.3.5                  sessioninfo_1.2.2           memoise_2.0.1              
[70] knitr_1.40                  profvis_0.3.7               BiocIO_1.7.1               
[73] usethis_2.1.6  
hpages commented 1 year ago

Can you show here the output you get when executing the following code?

library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo(Cfamiliaris)

Thanks

Priceless-P commented 1 year ago

Yes. This is it

> seqinfo(Cfamiliaris)
Seqinfo object with 147 sequences (1 circular) from Dog10K_Boxer_Tasha genome:
  seqnames seqlengths isCircular             genome
  1         122014068      FALSE Dog10K_Boxer_Tasha
  2          82037489      FALSE Dog10K_Boxer_Tasha
  3          94329250      FALSE Dog10K_Boxer_Tasha
  4          87912527      FALSE Dog10K_Boxer_Tasha
  5          88913986      FALSE Dog10K_Boxer_Tasha
  ...             ...        ...                ...
  chrUn206      10187      FALSE Dog10K_Boxer_Tasha
  chrUn207        729      FALSE Dog10K_Boxer_Tasha
  chrUn209       6860      FALSE Dog10K_Boxer_Tasha
  chrUn210       7637      FALSE Dog10K_Boxer_Tasha
  chrUn211       1317      FALSE Dog10K_Boxer_Tasha
hpages commented 1 year ago

Thanks. So we need to understand why identical(seqinfo(Cfamiliaris), getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE)) returns FALSE. Let's try to find the differences between the 2 objects. str() is the tool of choice to quickly inspect what's inside an object:

seqinfo1 <- seqinfo(Cfamiliaris)
str(seqinfo1)

seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE))
str(seqinfo2)

Do you see any difference between the outputs of str(seqinfo1) and str(seqinfo2)?

all.equal() can also be used to find the differences between 2 objects. It will summarize them but in a way that is not always easy to interpret. It doesn't hurt to try it though:

all.equal(seqinfo1, seqinfo2)
Priceless-P commented 1 year ago

This is the result I got

> seqinfo1 <- seqinfo(Cfamiliaris)
> str(seqinfo1)
Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  ..@ seqnames   : chr [1:147] "1" "2" "3" "4" ...
  ..@ seqlengths : int [1:147] 122014068 82037489 94329250 87912527 88913986 80213190 80419774 73585679 60315500 69219345 ...
  ..@ is_circular: logi [1:147] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..@ genome     : chr [1:147] "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" ...

str(seqinfo2) returns

> seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE)
> str(seqinfo2)
Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
  ..@ seqnames   : chr [1:147] "1" "2" "3" "4" ...
  ..@ seqlengths : int [1:147] 122014068 82037489 94329250 87912527 88913986 80213190 80419774 73585679 60315500 69219345 ...
  ..@ is_circular: logi [1:147] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..@ genome     : chr [1:147] "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" ...

I can't seem to find any difference between the two from the results above. all.equal(seqinfo1, seqinfo2) however picked something.

[1] "Attributes: < Component “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target

My guess is that the problem might be a difference in seqlengths. But like you said all.equal() is a bit hard to interpret so I don't know what “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target means.

hpages commented 1 year ago

I think that all.equal() is trying to tell us that the seqlengths slot of Seqinfo object seqinfo2 contains an NA. Seqinfo objects are S4 objects, and S4 objects have slots. Even though we can extract the value stored in a slot with the @ operator (e.g. with seqinfo2@seqlengths), the end user is not supposed to do that. The end user should always use the corresponding getter function for that. So in this case they should do:

seqlengths(seqinfo2)

Note that in this case, the getter function returns a named vector whereas doing seqinfo2@seqlengths returns an unnamed vector. So another benefit of using the getter function is that it tends to be a little bit more user-friendly than using @.

So yes, there seems to be an NA in seqlengths(seqinfo2):

table(is.na(seqlengths(seqinfo2)))
# FALSE  TRUE 
#   146     1 

which(is.na(seqlengths(seqinfo2)))
# MT 
# 40 

And it looks that this NA is for the sequence length of the MT sequence.

Now we need to assess whether that NA is expected or whether it's the result of a bug in the software. What do you think? (I'm asking to the Dog10K_Boxer_Tasha specialist here :wink:)

Priceless-P commented 1 year ago

(I'm asking to the Dog10K_Boxer_Tasha specialist here 😉)

Specialist? 😂

Now we need to assess whether that NA is expected or whether it's the result of a bug in the software. What do you think?

Well, I don't think it should be NA. But see what is included in the full report

Sequence-Name   Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
...
MT        assembled-molecule      MT                       Mitochondrion     CM023446.1      <>      NC_002008.4      non-nuclear      na            na
hpages commented 1 year ago

Exactly. That NA is actually coming from the "Full sequence report" for Dog10K_Boxer_Tasha. Glad that you spotted it there!

Let's remember that NCBI is just a huge centralized repository where researchers can submit (i.e. upload) assemblies for the organisms that they are studying. For example, in the case of Dog10K_Boxer_Tasha, the Submitter field on the NCBI page for this assembly says that it was submitted by the Dog Genome Sequencing Consortium.

It's hard to understand why the length of the MT sequence was set to na in the "Full sequence report" that they uploaded. I've already seen this before for a couple of other NCBI assemblies. It's a very rare thing though. No idea why they would do this :confused:

But let's try to dig a little bit more into this. The RefSeq-Accn field for this sequence is set to NC_002008.4 in the "Full sequence report". This is a RefSeq accession number. Let's try to use it to find more information about this sequence. We're going to proceed like we did here when we wanted to find more information about the MT sequence for Felis_catus_9.0.

So it's not that the length of the NC_002008.4 sequence is a mystery! :smile:

Now that you have forged and installed BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha, you should be able to confirm that the length of the MT sequence is as expected, by doing something like this:

library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo1 <- seqinfo(Cfamiliaris)
seqlengths(seqinfo1)["MT"]

Another way to do this is to load the DNA sequence for MT:

Cfamiliaris$MT

length(Cfamiliaris$MT)

So things look good as far as seqinfo1 is concerned. It's not identical to seqinfo2 but we've analyzed the difference and we know why they are different. We know where the problem is coming from, and it's not something that we can "fix". If we ignore the issue with MT's length, the 2 objects are actually the same. You can confirm this by setting the length of MT in seqinfo2 with:

seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]

and by comparing the 2 objects again:

identical(seqinfo1, seqinfo2)  # should be TRUE now!

Let me know if you have further questions about this.

A few last tests I'd like you to do:

Let me know how that goes.

Please add your fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R script and your seed file to your fork of the BSgenome package. Then commit, push, and summit a PR. Thanks!

Priceless-P commented 1 year ago

@hpages, I think my problem is bigger than just "MT" sequence length. seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"] does fix the first issue, however identical(seqinfo1, seqinfo2) still returns false 😥. I found out that seqlength of all the unplaced-scaffold returned from seqinfo1 are very different from those of seqinfo2. I could do seqlengths(seqinfo2) <- seqlengths(seqinfo1) but that doesn't solve the underlying problem.

Here's what I found interesting

seqlevelsStyle(Cfamiliaris) <- "UCSC"
seqlevelsStyle(Cfamiliaris) <- "RefSeq"
seqlevelsStyle(Cfamiliaris) <- "NCBI"

all work on Cfamiliaris (returns the excepted results) even though the seqlength is different I guess it's because the Sequence-Name is the same. I will go ahead and push the fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R and the seed file now. Maybe I'm doing something wrong somewhere.

hpages commented 1 year ago

seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"] does fix the first issue, however identical(seqinfo1, seqinfo2) still returns false 😥. I found out that seqlength of all the unplaced-scaffold returned from seqinfo1 are very different from those of seqinfo2.

Hmm.. interesting. We need to understand what's going on. Remember that all.equal(seqinfo1, seqinfo2) was reporting only one little difference between the two objects: an NA in seqinfo2@seqlengths not present in seqinfo1@seqlengths.

I just used your fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R script and BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha-seed seed file to forge BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha. Everything worked like a charm!

After installing the package, I did the following:

library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo1 <- seqinfo(Cfamiliaris)
seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE)

all.equal(seqinfo1, seqinfo2)
# [1] "Attributes: < Component “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target >"

seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]

all.equal(seqinfo1, seqinfo2)
# [1] TRUE

identical(seqinfo1, seqinfo2)
# [1] TRUE

All looks good!

I'm not sure what you did. Maybe you called getChromInfoFromNCBI() on the wrong assembly? Start again in a fresh R session and let me know how it goes.

Priceless-P commented 1 year ago

It works now.

Start again in a fresh R session

This was what I needed to do. Thanks!

hpages commented 1 year ago

Glad you were able to sort this out.

PR #49 merged. Thanks for your work!

Now that you've completed all the tasks in your group, feel free to choose your next task among the "Additional tasks" listed at the bottom of https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period I'll try to add a few more tasks there as time allows.