Closed hpages closed 2 years ago
Hi @hpages, please can I work on this issue?
Absolutely @Priceless-P !
Also don't forget to record https://github.com/Bioconductor/BSgenome/pull/46 on Outreachy.
Also don't forget to record https://github.com/Bioconductor/BSgenome/pull/46 on Outreachy.
Okay thank you @hpages
I have forged BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha
successfully. But while testing I discovered that
identical(seqinfo(Cfamiliaris), getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE))
returns FALSE
. I have tried to find out what I did wrong but I haven't 😥.
seqinfo(Cfamiliaris)
does return NCBI
.
This is the result of forgeBSgenomeDataPkg()
> forgeBSgenomeDataPkg("Dog10K_Boxer_Tasha-seed")
Creating package in ./BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha
Copying '/Users/prisca/Desktop/new/bsgenome/Dog10K_Boxer_Tasha.sorted.2bit' to './BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/inst/extdata/single_sequences.2bit' ... DONE
This is the result of R CMD build
Priscas-MacBook-Air:bsgenome prisca$ R CMD build BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha* checking for file ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/DESCRIPTION’ ... OK
* preparing ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1.tar.gz’
This is the result of R CMD check
Priscas-MacBook-Air:bsgenome prisca$ R CMD check BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1.tar.gz
* using log directory ‘/Users/prisca/Desktop/new/bsgenome/BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha.Rcheck’
* using R version 4.2.1 (2022-06-23)
* using platform: x86_64-apple-darwin17.0 (64-bit)
* using session charset: UTF-8
* checking for file ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha/DESCRIPTION’ ... OK
* this is package ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’ version ‘0.0.1’
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking for sufficient/correct file permissions ... OK
* checking whether package ‘BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha’ can be installed ... OK
* checking installed package size ... NOTE
installed size is 556.2Mb
sub-directories of 1Mb or more:
extdata 556.1Mb
* checking package directory ... OK
* checking DESCRIPTION meta-information ... NOTE
Package listed in more than one of Depends, Imports, Suggests, Enhances:
‘BSgenome’
A package should be listed in only one of these fields.
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... NOTE
prepare_Rd: package.Rd:14-16: Dropping empty section \details
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking examples ... OK
* checking PDF version of manual ... OK
* DONE
Status: 3 NOTEs
See
‘/Users/prisca/Desktop/new/bsgenome/BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha.Rcheck/00check.log’
And finally sessionInfo()
> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.5
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha_0.0.1
[2] BSgenome_1.65.2
[3] rtracklayer_1.57.0
[4] Biostrings_2.65.6
[5] XVector_0.37.1
[6] GenomicRanges_1.49.1
[7] GenomeInfoDb_1.33.12
[8] IRanges_2.31.2
[9] S4Vectors_0.35.4
[10] BiocGenerics_0.43.4
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 lattice_0.20-45 prettyunits_1.1.1
[4] Rsamtools_2.13.4 ps_1.7.1 utf8_1.2.2
[7] digest_0.6.29 mime_0.12 R6_2.5.1
[10] evaluate_0.17 highr_0.9 pillar_1.8.1
[13] zlibbioc_1.43.0 rlang_1.0.6 rstudioapi_0.14
[16] miniUI_0.1.1.1 callr_3.7.2 urlchecker_1.0.1
[19] Matrix_1.5-1 devtools_2.4.5 BiocParallel_1.31.13
[22] stringr_1.4.1 htmlwidgets_1.5.4 RCurl_1.98-1.9
[25] shiny_1.7.2 DelayedArray_0.23.2 xfun_0.33
[28] compiler_4.2.1 httpuv_1.6.6 pkgconfig_2.0.3
[31] pkgbuild_1.3.1 htmltools_0.5.3 SummarizedExperiment_1.27.3
[34] tibble_3.1.8 GenomeInfoDbData_1.2.9 codetools_0.2-18
[37] matrixStats_0.62.0 XML_3.99-0.11 fansi_1.0.3
[40] crayon_1.5.2 later_1.3.0 GenomicAlignments_1.33.1
[43] bitops_1.0-7 grid_4.2.1 xtable_1.8-4
[46] lifecycle_1.0.3 magrittr_2.0.3 cli_3.4.1
[49] stringi_1.7.8 cachem_1.0.6 fs_1.5.2
[52] promises_1.2.0.1 remotes_2.4.2 vctrs_0.4.2
[55] ellipsis_0.3.2 rjson_0.2.21 restfulr_0.0.15
[58] tools_4.2.1 Biobase_2.57.1 glue_1.6.2
[61] purrr_0.3.5 MatrixGenerics_1.9.1 processx_3.7.0
[64] pkgload_1.3.0 parallel_4.2.1 fastmap_1.1.0
[67] yaml_2.3.5 sessioninfo_1.2.2 memoise_2.0.1
[70] knitr_1.40 profvis_0.3.7 BiocIO_1.7.1
[73] usethis_2.1.6
Can you show here the output you get when executing the following code?
library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo(Cfamiliaris)
Thanks
Yes. This is it
> seqinfo(Cfamiliaris)
Seqinfo object with 147 sequences (1 circular) from Dog10K_Boxer_Tasha genome:
seqnames seqlengths isCircular genome
1 122014068 FALSE Dog10K_Boxer_Tasha
2 82037489 FALSE Dog10K_Boxer_Tasha
3 94329250 FALSE Dog10K_Boxer_Tasha
4 87912527 FALSE Dog10K_Boxer_Tasha
5 88913986 FALSE Dog10K_Boxer_Tasha
... ... ... ...
chrUn206 10187 FALSE Dog10K_Boxer_Tasha
chrUn207 729 FALSE Dog10K_Boxer_Tasha
chrUn209 6860 FALSE Dog10K_Boxer_Tasha
chrUn210 7637 FALSE Dog10K_Boxer_Tasha
chrUn211 1317 FALSE Dog10K_Boxer_Tasha
Thanks. So we need to understand why identical(seqinfo(Cfamiliaris), getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE))
returns FALSE
. Let's try to find the differences between the 2 objects. str()
is the tool of choice to quickly inspect what's inside an object:
seqinfo1 <- seqinfo(Cfamiliaris)
str(seqinfo1)
seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE))
str(seqinfo2)
Do you see any difference between the outputs of str(seqinfo1)
and str(seqinfo2)
?
all.equal()
can also be used to find the differences between 2 objects. It will summarize them but in a way that is not always easy to interpret. It doesn't hurt to try it though:
all.equal(seqinfo1, seqinfo2)
This is the result I got
> seqinfo1 <- seqinfo(Cfamiliaris)
> str(seqinfo1)
Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
..@ seqnames : chr [1:147] "1" "2" "3" "4" ...
..@ seqlengths : int [1:147] 122014068 82037489 94329250 87912527 88913986 80213190 80419774 73585679 60315500 69219345 ...
..@ is_circular: logi [1:147] FALSE FALSE FALSE FALSE FALSE FALSE ...
..@ genome : chr [1:147] "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" ...
str(seqinfo2)
returns
> seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE)
> str(seqinfo2)
Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
..@ seqnames : chr [1:147] "1" "2" "3" "4" ...
..@ seqlengths : int [1:147] 122014068 82037489 94329250 87912527 88913986 80213190 80419774 73585679 60315500 69219345 ...
..@ is_circular: logi [1:147] FALSE FALSE FALSE FALSE FALSE FALSE ...
..@ genome : chr [1:147] "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" "Dog10K_Boxer_Tasha" ...
I can't seem to find any difference between the two from the results above. all.equal(seqinfo1, seqinfo2)
however picked something.
[1] "Attributes: < Component “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target
My guess is that the problem might be a difference in seqlengths
. But like you said all.equal()
is a bit hard to interpret so I don't know what “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target
means.
I think that all.equal()
is trying to tell us that the seqlengths
slot of Seqinfo object seqinfo2
contains an NA. Seqinfo objects are S4 objects, and S4 objects have slots. Even though we can extract the value stored in a slot with the @
operator (e.g. with seqinfo2@seqlengths
), the end user is not supposed to do that. The end user should always use the corresponding getter function for that. So in this case they should do:
seqlengths(seqinfo2)
Note that in this case, the getter function returns a named vector whereas doing seqinfo2@seqlengths
returns an unnamed vector. So another benefit of using the getter function is that it tends to be a little bit more user-friendly than using @
.
So yes, there seems to be an NA
in seqlengths(seqinfo2)
:
table(is.na(seqlengths(seqinfo2)))
# FALSE TRUE
# 146 1
which(is.na(seqlengths(seqinfo2)))
# MT
# 40
And it looks that this NA
is for the sequence length of the MT sequence.
Now we need to assess whether that NA
is expected or whether it's the result of a bug in the software. What do you think? (I'm asking to the Dog10K_Boxer_Tasha
specialist here :wink:)
(I'm asking to the Dog10K_Boxer_Tasha specialist here 😉)
Specialist? 😂
Now we need to assess whether that NA is expected or whether it's the result of a bug in the software. What do you think?
Well, I don't think it should be NA. But see what is included in the full report
Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
...
MT assembled-molecule MT Mitochondrion CM023446.1 <> NC_002008.4 non-nuclear na na
Exactly. That NA
is actually coming from the "Full sequence report" for Dog10K_Boxer_Tasha
. Glad that you spotted it there!
Let's remember that NCBI is just a huge centralized repository where researchers can submit (i.e. upload) assemblies for the organisms that they are studying. For example, in the case of Dog10K_Boxer_Tasha
, the Submitter field on the NCBI page for this assembly says that it was submitted by the Dog Genome Sequencing Consortium.
It's hard to understand why the length of the MT
sequence was set to na
in the "Full sequence report" that they uploaded. I've already seen this before for a couple of other NCBI assemblies. It's a very rare thing though. No idea why they would do this :confused:
But let's try to dig a little bit more into this. The RefSeq-Accn
field for this sequence is set to NC_002008.4
in the "Full sequence report". This is a RefSeq accession number. Let's try to use it to find more information about this sequence. We're going to proceed like we did here when we wanted to find more information about the MT
sequence for Felis_catus_9.0
.
LOCUS
line. It says the following:
LOCUS NC_002008 16727 bp DNA circular MAM 07-JUN-2021
This tells you that this sequence is a DNA sequence of length 16727 bp ("bp" stands for base pairs), and that it's a circular sequence.
ORIGIN
section shows the sequence itself:
ORIGIN
1 gttaatgtag cttaattaat aaagcaaggc actgaaaatg ccaagatgag tcgcacgact
61 ccataaacat aaaggtttgg tcctagcctt cctattagtt tttagtagac ttacacatgc
121 aagcctccac gccccagtga gaatgccctt aaaatcacca gtgatctaaa ggagcaggta
...
...
...
16561 agctttattt aaattatata caaatgtatt gctactctag ttaacttaac acaacagtct
16621 tacacgcatt tggtctcgta gtctatctat agatagcatt cccttttttt tccctctcat
16681 atttactatg tattttattt attacgcaca ctacaatttc agtataa
So it's not that the length of the NC_002008.4
sequence is a mystery! :smile:
Now that you have forged and installed BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha, you should be able to confirm that the length of the MT
sequence is as expected, by doing something like this:
library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo1 <- seqinfo(Cfamiliaris)
seqlengths(seqinfo1)["MT"]
Another way to do this is to load the DNA sequence for MT
:
Cfamiliaris$MT
length(Cfamiliaris$MT)
So things look good as far as seqinfo1
is concerned. It's not identical to seqinfo2
but we've analyzed the difference and we know why they are different. We know where the problem is coming from, and it's not something that we can "fix". If we ignore the issue with MT
's length, the 2 objects are actually the same. You can confirm this by setting the length of MT
in seqinfo2
with:
seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]
and by comparing the 2 objects again:
identical(seqinfo1, seqinfo2) # should be TRUE now!
Let me know if you have further questions about this.
A few last tests I'd like you to do:
Cfamiliaris
to their UCSC names.Cfamiliaris
to their RefSeq accession numbers.Cfamiliaris
back to their original names (NCBI).Let me know how that goes.
Please add your fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R
script and your seed file to your fork of the BSgenome package. Then commit, push, and summit a PR. Thanks!
@hpages, I think my problem is bigger than just "MT" sequence length.
seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]
does fix the first issue, however identical(seqinfo1, seqinfo2)
still returns false 😥. I found out that seqlength of all the unplaced-scaffold returned from seqinfo1
are very different from those of seqinfo2
. I could do seqlengths(seqinfo2) <- seqlengths(seqinfo1)
but that doesn't solve the underlying problem.
Here's what I found interesting
seqlevelsStyle(Cfamiliaris) <- "UCSC"
seqlevelsStyle(Cfamiliaris) <- "RefSeq"
seqlevelsStyle(Cfamiliaris) <- "NCBI"
all work on Cfamiliaris
(returns the excepted results) even though the seqlength is different I guess it's because the Sequence-Name is the same. I will go ahead and push the fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R
and the seed file now. Maybe I'm doing something wrong somewhere.
seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]
does fix the first issue, howeveridentical(seqinfo1, seqinfo2)
still returns false 😥. I found out that seqlength of all the unplaced-scaffold returned fromseqinfo1
are very different from those ofseqinfo2
.
Hmm.. interesting. We need to understand what's going on. Remember that all.equal(seqinfo1, seqinfo2)
was reporting only one little difference between the two objects: an NA
in seqinfo2@seqlengths
not present in seqinfo1@seqlengths
.
I just used your fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R
script and BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha-seed
seed file to forge BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha. Everything worked like a charm!
After installing the package, I did the following:
library(BSgenome.Cfamiliaris.NCBI.Dog10KBoxerTasha)
seqinfo1 <- seqinfo(Cfamiliaris)
seqinfo2 <- getChromInfoFromNCBI("Dog10K_Boxer_Tasha", as.Seqinfo=TRUE)
all.equal(seqinfo1, seqinfo2)
# [1] "Attributes: < Component “seqlengths”: 'is.NA' value mismatch: 1 in current 0 in target >"
seqlengths(seqinfo2)["MT"] <- seqlengths(seqinfo1)["MT"]
all.equal(seqinfo1, seqinfo2)
# [1] TRUE
identical(seqinfo1, seqinfo2)
# [1] TRUE
All looks good!
I'm not sure what you did. Maybe you called getChromInfoFromNCBI()
on the wrong assembly? Start again in a fresh R session and let me know how it goes.
It works now.
Start again in a fresh R session
This was what I needed to do. Thanks!
Glad you were able to sort this out.
PR #49 merged. Thanks for your work!
Now that you've completed all the tasks in your group, feel free to choose your next task among the "Additional tasks" listed at the bottom of https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period I'll try to add a few more tasks there as time allows.
This task depends on this issue and issue #38 being completed first (i.e. PRs accepted and merged, and issues closed). Although it's not a requirement that the 3 tasks be completed by the same applicant, it will be a more interesting learning experience if they are.
BSgenome data packages are one of the many types of annotation packages available in Bioconductor. They contain the genomic sequences, which comprise chromosome sequences and other DNA sequences, of a particular genome assembly for a given organism. For example BSgenome.Amellifera.NCBI.AmelHAv3.1 is a BSgenome data package that contains the genomic sequences of NCBI assembly
Amel_HAv3.1
(RefSeq assembly accession:GCF_003254395.2
). Users can easily and efficiently access the sequences, or portions of the sequences, stored in these packages, via a common API implemented in the BSgenome software package.This task's goal is to make a new BSgenome data package for NCBI assembly
Dog10K_Boxer_Tasha
. The process of making such package is partially documented in the "How to forge a BSgenome data package" vignette from the BSgenome software package. The landing page for the BSgenome package contains a link to this vignette.Here's some additional information
For an NCBI assembly, the sequence data can be found in the form of compressed FASTA files located in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly associated with the assembly. You'll find the links to those FTP directories on the right side of the NCBI page of the
Dog10K_Boxer_Tasha
assembly, below the link to the Full sequence report.Note that NCBI uses the
.fna
suffix for uncompressed FASTA files (UCSC uses the.fa
suffix). For compressed FASTA files, NCBI and UCSC both add the.gz
suffix to the.fna
or.fa
suffix. As you will see, NCBI can provide several.fna.gz
files in the FTP directory for RefSeq assembly or FTP directory for GenBank assembly.The file that contains all the DNA sequences for the assembly is expected to have a name that looks like this:
<RefSeq assembly accession>_<assembly name>_genomic.fna.gz
in the FTP directory for RefSeq assembly, and<GenBank assembly accession>_<assembly name>_genomic.fna.gz
in the FTP directory for GenBank assembly.How you will proceed
The first thing you should do is download the appropriate FASTA file. Choose the RefSeq one if
Dog10K_Boxer_Tasha
was registered with its RefSeq assembly accession, or the GenBank one if it was registered with its GenBank assembly accession.Then load it into R with the
readDNAStringSet()
function from the Biostrings package. The object returned byreadDNAStringSet()
is a DNAStringSet object. Take a look at the names on this object (callnames()
on the object to extract them, or callhead(names())
to look at the 6 first names only). As you can see, these names are long and a little bit cryptic. They don't exactly look like the sequence names returned bygetChromInfoFromNCBI("Dog10K_Boxer_Tasha")
:cry: We'll take care of this below.The
fasta_to_sorted_2bit.R
scriptAs we know, the forging process only supports the two following options for the input sequence data: a single 2bit file, or a collection of FASTA files with one file per sequence. This means that we won't be able to use the
<assembly accession>_<assembly name>_genomic.fna.gz
file as-is. We will first need to preprocess the file as follow: (1) either convert the file to the 2bit format, or (2) split the file into a collection of smaller FASTA files where each file contains only one sequence. It is recommended to do (1).Here are some examples of R scripts that have been used to perform this conversion for other NCBI assemblies:
Amel_HAv3.1
:inst/extdata/GentlemanLab/BSgenome.Amellifera.NCBI.AmelHAv3.1-tools/fasta_to_sorted_2bit.R
Macaca_fascicularis_6.0
:inst/extdata/GentlemanLab/BSgenome.Mfascicularis.NCBI.6.0-tools/fasta_to_sorted_2bit.R
Note that these scripts also take care of two important things:
readDNAStringSet()
with the sequence names returned bygetChromInfoFromNCBI()
.getChromInfoFromNCBI()
.Choose one script, download it or copy it to your computer, rename it
fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R
, and adapt it to make it work for your<assembly accession>_<assembly name>_genomic.fna.gz
file. When you work on this, it will help you if you try to execute the script line by line, and take the time to study what each line is doing. Don't hesitate to inspect each intermediate result. Name the output file produced by your scriptDog10K_Boxer_Tasha.sorted.2bit
.Once the script is ready, please add it to your fork of the BSgenome package, in the
inst/extdata/Outreachy/
folder.The seed file
The seed file that you're going to prepare for the forging process should point to the 2bit file that you produced with your
fasta_to_sorted_2bit.R
script above.Prepare the file and use it to forge the BSgenome data package.
Test the new package like you did with BSgenome.Cfamiliaris.UCSC.canFam6. That is: install it, start R, load the package, and play around with it. Proceed with some "ad hoc manual testing". Once everything behaves and looks as expected, run
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.Once you are confident that your new BSgenome data package works as expected, please add your seed file to your fork of the BSgenome package, in the
inst/extdata/Outreachy/
folder.Finally commit, push, and submit a PR.