Closed hpages closed 2 years ago
Hi, @hpages please can you assign me to this task?
Done. Don't hesitate to ask if you have questions.
Thank you sir, I will start working on right away
Good day sir please, I don't understand what I am to do exactly. I am a bit confused sir
Hi @Simplecodez ,
Can you try to formulate a more precise question? I'd be happy to provide as much clarification as needed but it'll be easier for me if I know a little bit more about what is not clear in the task description above, and whether you have tried things or not already.
Or perhaps you want to try to start with task #46 instead? This is the 1st task in your group of tasks (Frog). See list of tasks here. The current issue is the 2nd task in the group. Note that the other two applicants have chosen to start with the 1st task in their respective group (Dog and Cat), which is to register an UCSC genome in the GenomeInfoDb package. The 2nd task in each group is to register an NCBI assembly in the GenomeInfoDb package. So by choosing the 1st task in the Frog group, you'll be working on a task that is very similar to tasks #43 (Dog) and #49 (Cat). Maybe the discussions there will help you get started with issue #46?
Let me know if you want to switch.
Thank you sir, I have just forked and clone the repo locally
Good day sir. I would really love to contribute to this project but I don't really know what to do. I have just forked and cloned the repo but don't know the files I am to edit or change sir.
What about my suggestion to switch to #46? I think it's going to be easier for a first task.
What about my suggestion to switch to #46? I think it's going to be easier for a first task.
Good day I really appreciate your patient with me and your suggestion, but I have really done alot of research on this one to back out now. I have created the Xenopus_tropicalis.R file and registered the organism but when I run R CMD check Xenopus_tropicalis.R, I get this error: Error in getOctD(x, offset, len) : invalid octal digit. I don't know why sir.
... but I have really done alot of research on this one to back out now.
Hmm.. but you understand that all the research and work you've done so far won't be in vain because you'll resume your work on this issue after you're done with issue #46 right? Anyways, it's up to you.
... but when I run R CMD check Xenopus_tropicalis.R
This is not how we use R CMD check
. Please read carefully my IMPORTANT NOTES TO OUTREACHY APPLICANTS at the top of this issue.
Also it's too early to try to run R CMD check
. The R CMD build
and R CMD check
steps are usually the very last steps before a commit. Before we run them, we need to validate our changes via some "ad hoc manual testing" (as explained in my IMPORTANT NOTES TO OUTREACHY APPLICANTS above).
In this case, the ad hoc manual testing would consist in installing the modified GenomeInfoDb package, starting a fresh R session, loading GenomeInfoDb (with library(GenomeInfoDb)
), and do the following:
registered_NCBI_assemblies("Xenopus tropicalis")
works and returns the correct data.getChromInfoFromNCBI("UCB_Xtro_10.0")
works and returns the correct data.Good day sir, I have successfully installed the edit GenomeInfo locally and tested the registered_NCBI_assemblies("Xenopus tropicalis") functionality which returns the correct data. But when I try this function getChromInfoFromNCBI("UCB_Xtro_10.0"), an error saying: Error in function (type, msg, asError = True) : could not retrieve from host: ftp.ncbi.nlm.nih.gov
Hi @Simplecodez ,
But when I try this function getChromInfoFromNCBI("UCB_Xtro_10.0"), an error saying: Error in function (type, msg, asError = True) : could not retrieve from host: ftp.ncbi.nlm.nih.gov
It seems that getChromInfoFromNCBI()
was not able to access NCBI FTP site to download the "Full sequence report" for UCB_Xtro_10.0
. (See here for some explanation I provided in another issue about the "Full sequence report".)
This error could happen because the site was temporarily down or because your internet connection was temporarily down. Can you check your internet connection and try again? Also please provide the output of your sessionInfo()
.
Thanks, H.
Okay, sir. I will try again later. Thank you.
Thank you sir, i just ran getChromInfoFromNCBI("UCB_Xtro_10.0") and it outputs the correct data.
This is the output of sessionInfo():
function (package = NULL)
{
z <- list()
z$R.version <- R.Version()
z$platform <- z$R.version$platform
if (nzchar(.Platform$r_arch))
z$platform <- paste(z$platform, .Platform$r_arch, sep = "/")
z$platform <- paste0(z$platform, " (", 8 * .Machine$sizeof.pointer,
"-bit)")
z$locale <- Sys.getlocale()
z$running <- osVersion
z$RNGkind <- RNGkind()
if (is.null(package)) {
package <- grep("^package:", search(), value = TRUE)
keep <- vapply(package, function(x) x == "package:base" ||
!is.null(attr(as.environment(x), "path")), NA)
package <- .rmpkg(package[keep])
}
pkgDesc <- lapply(package, packageDescription, encoding = NA)
if (length(package) == 0)
stop("no valid packages were specified")
basePkgs <- sapply(pkgDesc, function(x) !is.null(x$Priority) &&
x$Priority == "base")
z$basePkgs <- package[basePkgs]
if (any(!basePkgs)) {
z$otherPkgs <- pkgDesc[!basePkgs]
names(z$otherPkgs) <- package[!basePkgs]
}
loadedOnly <- loadedNamespaces()
loadedOnly <- loadedOnly[!(loadedOnly %in% package)]
if (length(loadedOnly)) {
names(loadedOnly) <- loadedOnly
pkgDesc <- c(pkgDesc, lapply(loadedOnly, packageDescription))
z$loadedOnly <- pkgDesc[loadedOnly]
}
z$matprod <- as.character(options("matprod"))
es <- extSoftVersion()
z$BLAS <- as.character(es["BLAS"])
z$LAPACK <- La_library()
l10n <- l10n_info()
if (!is.null(l10n["system.codepage"]))
z$system.codepage <- as.character(l10n["system.codepage"])
if (!is.null(l10n["codepage"]))
z$codepage <- as.character(l10n["codepage"])
class(z) <- "sessionInfo"
z
}
<bytecode: 0x000002188d649bf0>
<environment: namespace:utils>
So sir, can i run R CMD build and R CMD check now?
Thank you sir, i just ran getChromInfoFromNCBI("UCB_Xtro_10.0") and it outputs the correct data.
Great. If you're confident that everything looks good, then please proceed with the R CMD build
and R CMD check
steps.
If that goes as expected, then commit your work and submit a PR. Don't forget to add the Xenopus_tropicalis.R
file (with git add Xenopus_tropicalis.R
) before you commit.
This is the output of sessionInfo()
You're showing the body of the function, not the output produced by calling the function. I need the latter. Thanks!
Good day sir. This is the output of my sessionInfo():
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Ubuntu 20.04 x64
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GenomeInfoDb_1.33.11 IRanges_2.31.2 S4Vectors_0.35.4
[4] BiocGenerics_0.43.4
loaded via a namespace (and not attached):
[1] compiler_4.2.1 GenomeInfoDbData_1.2.9 RCurl_1.98-1.9
[4] bitops_1.0-7
Hi @Simplecodez ,
Thanks for providing your sessionInfo()
. I see that you've installed R for Windows but that it's "running under Ubuntu 20.04 x64". This is a very unconventional setup. I didn't even know it was possible! Did you install an Ubuntu terminal environment on the Windows Subsystem for Linux (WSL), as documented here? I have no experience with the WSL so I hope that your setup will not be problematic.
For what is worth, sessionInfo()
usually reports something like this on an Ubuntu system:
> sessionInfo()
R version 4.2.0 Patched (2022-05-04 r82318)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS
This is what I get on my machine.
I wish you had a more conventional Linux setup. As previously discussed with you in the #outreachy channel on the community-slack (on Oct 10), this is easy to achieve by installing Ubuntu alongside Windows.
Anyways, were you able to run the R CMD build
and R CMD check
steps successfully?
Thanks
I got this error when I ran R CMD build
Error in loadvignetteBuilder(pkgdir, True) : vignette builder 'knitr' not found
Please how do I fix this
@Simplecodez Did you see my answer on the community-bioc Slack?
Yes sir, I did. I am install knitr package now. Thank you
Good day sir. this is the result of R CMD check GenomeInfoDb_1.33.11.tar.gz
* using log directory 'C:/Users/emma/Desktop/GenomeInfoDb.Rcheck'
* using R version 4.2.1 (2022-06-23 ucrt)
* using platform: x86_64-w64-mingw32 (64-bit)
* using session charset: ISO8859-1
* checking for file 'GenomeInfoDb/DESCRIPTION' ... OK
* this is package 'GenomeInfoDb' version '1.33.11'
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... OK
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking whether package 'GenomeInfoDb' can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking dependencies in R code ... NOTE
Unexported object imported by a ':::' call: 'utils:::.roman2numeric'
See the note in ?`:::` about the use of this operator.
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking files in 'vignettes' ... WARNING
Files in the 'vignettes' directory but no files in 'inst/doc':
'Accept-organism-for-GenomeInfoDb.Rnw', 'GenomeInfoDb.Rnw'
* checking examples ... OK
* checking for unstated dependencies in 'tests' ... OK
* checking tests ... ERROR
Running 'run_unitTests.R'
Running the tests in 'tests/run_unitTests.R' failed.
Last 13 lines of output:
1 Test Suite :
GenomeInfoDb RUnit Tests - 21 test functions, 1 error, 0 failures
ERROR in test_seqlevelsStyle_Seqinfo: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 1500 did not have 10 elements
Test files with failing tests
test_seqlevelsStyle.R
test_seqlevelsStyle_Seqinfo
Error in BiocGenerics:::testPackage("GenomeInfoDb") :
unit tests failed for package GenomeInfoDb
Calls: <Anonymous> -> <Anonymous>
Execution halted
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... WARNING
Directory 'inst/doc' does not exist.
Package vignettes without corresponding single PDF/HTML:
'Accept-organism-for-GenomeInfoDb.Rnw'
'GenomeInfoDb.Rnw'
* checking running R code from vignettes ... NONE
'Accept-organism-for-GenomeInfoDb.Rnw' using 'UTF-8'... OK
'GenomeInfoDb.Rnw' using 'UTF-8'... OK
* checking re-building of vignette outputs ... ERROR
Error(s) in re-building vignettes:
--- re-building 'Accept-organism-for-GenomeInfoDb.Rnw' using knitr
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, aperm,
append, as.data.frame, basename, cbind, colnames, dirname,
do.call, duplicated, eval, evalq, get, grep, grepl, intersect,
is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
setdiff, sort, table, tapply, union, unique, unsplit, which.max,
which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
I, expand.grid, unname
Loading required package: IRanges
Attaching package: 'IRanges'
The following object is masked from 'package:grDevices':
windows
Error: processing vignette 'Accept-organism-for-GenomeInfoDb.Rnw' failed with diagnostics:
pdflatex is not available
--- failed re-building 'Accept-organism-for-GenomeInfoDb.Rnw'
--- re-building 'GenomeInfoDb.Rnw' using knitr
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Error: processing vignette 'GenomeInfoDb.Rnw' failed with diagnostics:
pdflatex is not available
--- failed re-building 'GenomeInfoDb.Rnw'
SUMMARY: processing the following files failed:
'Accept-organism-for-GenomeInfoDb.Rnw' 'GenomeInfoDb.Rnw'
Error: Vignette re-building failed.
Execution halted
* checking PDF version of manual ... WARNING
LaTeX errors when creating PDF version.
This typically indicates Rd problems.
* checking PDF version of manual without index ... ERROR
Re-running with no redirection of stdout/stderr.
* DONE
Status: 3 ERRORs, 3 WARNINGs, 1 NOTE
this is the result of R CMD build GenomeInfoDb
* checking for file 'GenomeInfoDb/DESCRIPTION' ... OK
* preparing 'GenomeInfoDb':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building 'Accept-organism-for-GenomeInfoDb.Rnw' using knitr
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, aperm,
append, as.data.frame, basename, cbind, colnames, dirname,
do.call, duplicated, eval, evalq, get, grep, grepl, intersect,
is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
setdiff, sort, table, tapply, union, unique, unsplit, which.max,
which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
I, expand.grid, unname
Loading required package: IRanges
Attaching package: 'IRanges'
The following object is masked from 'package:grDevices':
windows
Error: processing vignette 'Accept-organism-for-GenomeInfoDb.Rnw' failed with diagnostics:
pdflatex is not available
--- failed re-building 'Accept-organism-for-GenomeInfoDb.Rnw'
--- re-building 'GenomeInfoDb.Rnw' using knitr
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Error: processing vignette 'GenomeInfoDb.Rnw' failed with diagnostics:
pdflatex is not available
--- failed re-building 'GenomeInfoDb.Rnw'
SUMMARY: processing the following files failed:
'Accept-organism-for-GenomeInfoDb.Rnw' 'GenomeInfoDb.Rnw'
Error: Vignette re-building failed.
Execution halted
Hi @Simplecodez ,
It looks like you don't have the pdflatex
command on your system. This command is part of TeX/LaTeX.
TeX/LaTeX is required to build the Sweave vignettes contained in the package. Vignettes are documents located in the vignettes/
folder of an R package. There are 2 types of vignettes: Sweave vignettes (extension .Rnw
) and R Markdown vignettes (.Rmd
extension). Only Sweave vignettes require TeX/LaTeX.
On Ubuntu, and other Debian-like systems, you can install TeX/LaTeX with:
sudo apt-get texlive
Make sure to also install all the following additional Debian packages:
texlive-font-utils
texlive-pstricks
texlive-latex-extra
texlive-fonts-extra
texlive-bibtex-extra
texlive-science
texlive-luatex
texlive-lang-european
texi2html
texinfo
pandoc
pandoc-citeproc
biber
All these Debian packages can be installed with sudo apt-get install <package>
Then try R CMD build GenomeInfoDb
again.
Let me know how that goes.
Okay, sir. I will that and get back to you. Thank you
Good day sir. I am sorry for not getting back to you sooner. I have installed texlive and the addition Debian packages. The result below is what I got after running R CMD build GenomeInfoDb
.
I also noticed that GenomeInfoDb_1.33.11.tar.gz has been created in the folder housing GenomeInfoDb
* checking for file ‘GenomeInfoDb/DESCRIPTION’ ... OK
* preparing ‘GenomeInfoDb’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘GenomeInfoDb_1.33.11.tar.gz’
@Simplecodez It's great that you were able to install texlive plus all the additional packages! So now it seems that you can successfully R CMD build GenomeInfoDb
. That's really good progress.
One thing I notice is that your fork is still at version 1.33.11. However the GenomeInfoDb repository that you forked from is now at version 1.33.15. Please sync your fork. That will bring it at version 1.33.15. Then run R CMD build GenomeInfoDb
again. This time it should produce a source tarball with a name that reflects the latest version of the package (i.e. GenomeInfoDb_1.33.15.tar.gz
).
Then you can delete the previous tarball and run R CMD check
on the new one. If everything works fine, then commit, push, and create a PR (Pull Request). Don't forget to add Xenopus_tropicalis.R
to git (with git add Xenopus_tropicalis.R
) before you commit. Thanks!
I have synced my fork with the latest version and built and checked the files successfully I have also created a pull request. Thank you sir
Can I start working on task #46?
I have synced my fork with the latest version and built and checked the files successfully I have also created a pull request. Thank you sir
I noticed I made some mistakes in my first commit so I closed the first pull request and opened another one. I hope there is no problem with that?
Can I start working on task #46?
Absolutely. I just assigned you to the task.
I noticed I made some mistakes in my first commit so I closed the first pull request and opened another one. I hope there is no problem with that?
No problem at all. Thanks for the PR. I'm going to take a look at it.
Okay, thank you
Hi, @hpages. I just made the corrections pointed out in the PR I submitted and created and PR. I am anticipating your reply sir. Thank you
I also didn't understand what you meant by indentation with tab and spaces, so I just copied a previous registration and edited it accordingly. I hope everything works fine now. Thank you
Hi @Simplecodez , please do not create a new PR each time you make a correction to a PR. This is not necessary and it makes it difficult to follow. All you need to do is make the requested changes, commit them, and push them. The new commits will automatically be added to the current PR. The problem with closing and creating a new PR each time you make a change is that the new PR doesn't include the discussion that we started in the PR that you closed. We want the entire discussion about the PR to remain in one place.
I also didn't understand what you meant by indentation with tab and spaces
What I meant is that the file contains tabs. No other registration file contains tabs:
hpages@spectre:~/github/Simplecodez/GenomeInfoDb/inst/registered/NCBI_assemblies$ grep -P "\t" *.R
Xenopus_tropicalis.R: assembly_level="Chromosome",
Xenopus_tropicalis.R: assembly_level="Chromosome",
They need to be replaced with spaces.
Thanks!
I have made the corrections sir Thank you
Thanks for removing the tabs. More comments in PR #70. Please address so I can merge the PR and we can finally focus on task #46. Thanks again.
Hi @Simplecodez,
I just merged PR #70. :tada:
Congratulation on your first contribution to Bioconductor! Don't forget to record it on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.
Let's focus on #46 now. I'll go there and try to answer your questions.
Thank you very much sir. I am really honoured to be a contributor. Thank you for your help and patience.
Hi @Simplecodez,
I just merged PR #70. tada
Congratulation on your first contribution to Bioconductor! Don't forget to record it on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.
Please sir how do I get my contribution link? Is it this: ```https://github.com/Bioconductor/GenomeInfoDb/issues/47
Let's focus on #46 now. I'll go there and try to answer your questions.
Yes, I guess you are supposed to use the link to the GitHub issue for the task that you accomplished.
Okay, thank you.
UCB_Xtro_10.0
is a Western clawed frog (Xenopus tropicalis) assembly available at NCBI: https://www.ncbi.nlm.nih.gov/assembly/GCF_000004195.4/Note that
UCB_Xtro_10.0
is the assembly thatxenTro10
, the latest UCSC genome for the Western clawed frog, is based on. See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find the Western clawed frog in the UCSC species tree on the left, click on it, then make sure to select the latest X. tropicalis Assembly (
xenTro10
). This will display a bunch of additional information about thexenTro10
assembly. In particular, it will indicate what NCBI assembly this genome is based on. This information is the Accession ID field. This field is usually set to a GenBank (GCA_000*.*
) or RefSeq (GCF_000*.*
) accession number.Note that many NCBI assemblies are already registered in the GenomeInfoDb package (223 as of October 2022!). The
registered_NCBI_assemblies()
function in GenomeInfoDb returns the list of all the NCBI assemblies that are currently registered in the package. An important thing to be aware of is thatgetChromInfoFromNCBI()
still works on an unregistered assembly, but in "degraded" mode, that is:NA
s instead ofFALSE
s in thecircular
column of the returned data.frame.Registering an assembly fixes that. In other words, once an NCBI assembly is registered in GenomeInfoDb,
getChromInfoFromNCBI()
will recognize its name and return accurate circularity flags.See
?getChromInfoFromNCBI
(after loading GenomeInfoDb) for more information.Registering a new NCBI assembly for an organism that is already supported is only a matter of editing the corresponding file in
GenomeInfoDb/inst/registered/NCBI_assemblies/
. If this is a new organism, then we need to start a new file. See the other files for the naming scheme: the name of the file must be the full scientific name of the organism, with the underscore used as separator, and with the first letter capitalized. Extension must be.R
.IMPORTANT NOTES TO OUTREACHY APPLICANTS:
R CMD build
andR CMD check
on the package. Note thatR CMD check
should always be run on the source tarball produced byR CMD build
.R CMD check
might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!