Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Add 'replace' argument to forgeBSgenomeDataPkg() #50

Closed hpages closed 1 year ago

hpages commented 1 year ago

When trying to forge a new BSgenome data package with forgeBSgenomeDataPkg(), the user might sometimes perform several attempts before a successful forgeBSgenomeDataPkg() run (typically with some adjustments to their seed file between each attempt). However, before trying forgeBSgenomeDataPkg() again, the user is expected to manually remove the package folder created by the previous attempt. Otherwise, they'll get an error like this:

> forgeBSgenomeDataPkg("BSgenome.Rnorvegicus.UCSC.rn7-seed")
Creating package in ./BSgenome.Rnorvegicus.UCSC.rn7
Error in Biobase::createPackage(x@Package, destdir, template_path, symvals) : 
  directory './BSgenome.Rnorvegicus.UCSC.rn7' exists; use unlink=TRUE to remove it, or choose another destination directory

Note that the error message is misleading and confusing: it suggests that the user can fix the problem by using unlink=TRUE but trying to do so does not work:

> forgeBSgenomeDataPkg("BSgenome.Rnorvegicus.UCSC.rn7-seed", unlink=TRUE)
Error in forgeBSgenomeDataPkg("BSgenome.Rnorvegicus.UCSC.rn7-seed",  : 
  unused argument (unlink = TRUE)

That's because the forgeBSgenomeDataPkg() does not have this argument:

> args(forgeBSgenomeDataPkg)
function (x, seqs_srcdir = ".", destdir = ".", verbose = TRUE) 
NULL

Proposal

We propose to add the replace argument to the forgeBSgenomeDataPkg() function. replace should be set to FALSE by default.

If the directory to create already exists, and replace is set to TRUE, the function should replace it, that is, remove it and create it again.

If the directory to create already exists, and replace is set to FALSE (the default), the function should fail with an informative error message. For example, the error message could say something like this:

directory './BSgenome.Rnorvegicus.UCSC.rn7' exists; use replace=TRUE to replace it

The new argument should be documented in BSgenome/man/BSgenomeForge.Rd.

Priceless-P commented 1 year ago

Hi @hpages, can I work on this issue?

Priceless-P commented 1 year ago

@hpages, please how do I check if replace is set to TRUE? I tried if (file.exists(pkgdir) && isTRUE(replace)) and also if (file.exists(pkgdir) && replace == TRUE) but none worked.

Also, I get this error when I do ?forgeBSgenomeDataPkg(), after I edited BSgenome/man/BSgenomeForge.Rd

Error Retrieving Help
Error in fetch(key) :
lazy-load database 'Library/Frameworks/Rsesouces/library/BSgenome/help/BSgenome.rdb is corrupt
hpages commented 1 year ago

Hi @Priceless-P ,

Note that forgeBSgenomeDataPkg() is an S4 generic function so we need to add the new replace argument to the generic function and all its methods (BSgenome/R/BSgenomeForge.R defines 3 forgeBSgenomeDataPkg() methods). I suggest that you add the new argument in 4th position i.e. between destdir and verbose. The argument's default value (FALSE) should be the same everywhere.

please how do I check if replace is set to TRUE?,

First make sure to add a check on the replace's value supplied by the user. Use isTRUEorFALSE() for the check. The best place to put this check is at the beginning of the forgeBSgenomeDataPkg() method for BSgenomeDataPkgSeed objects (the 1st method). That's because, the 2 other methods will ultimately call the 1st method. So place something like this at the beginning of the 1st method (for example, right below the first if statement):

        if (!isTRUEorFALSE(replace))
            stop("'replace' must be TRUE or FALSE")

Then, because we've already checked that replace is TRUE or FALSE, we can use it directly in an if statement e.g. we can do things like

if (replace) {
    ...
}

or

if (file.exists(pkgdir) && replace) {
    ...
}

We don't need to wrap it in a call to isTRUE() and we don't need to do replace == TRUE either. Note that we would use isTRUE(replace) if there was the possibility that replace is not TRUE or FALSE e.g. if there was the possibility that it's NA or a logical vector of length != 1 or any other object that would break if (replace) ....

I tried if (file.exists(pkgdir) && isTRUE(replace)) and also if (file.exists(pkgdir) && replace == TRUE) but none worked.

Please provide more details. What error do you get? Note that it's much easier to discuss things like that if you show the code that others can run to reproduce the error.

Also, I get this error when I do ?forgeBSgenomeDataPkg(), after I edited BSgenome/man/BSgenomeForge.Rd

Make sure that you reinstall the package and that you try to open the man page in a fresh R session. Also pay attention to the output of R CMD INSTALL when you reinstall the package. It might be able to detect what's wrong with the man page and emit a NOTE or WARNING about the problem.

Let me know how that goes.

Priceless-P commented 1 year ago

I posted a question 5mins ago then I deleted it. It took me going through my question to figure out what my mistake was 😂. Let me correct my errors and try again. Hopefully, it works this time.

And the ?forgeBSgenomeDataPkg() works fine now.

hpages commented 1 year ago

It took me going through my question to figure out what my mistake was.

This happens quite often. The simple fact of going thru the process of formulating a precise question forces us to re-examine what we did, and that often leads us to see things that we missed before. So yeah, it's not too surprising that you were able to figure out what your mistake was. :smiley:

2 notes about this:

        if (file.exists(pkgdir) && replace) {
          unlink(pkgdir, recursive=TRUE)
          dir.create(pkgdir)
        }
        if (dir.exists(pkgdir) && !replace) {

          stop("directory", pkgdir,
               " exists. Use replace=TRUE to replace it")
        }

There's no reason to use file.exists() in one place and dir.exists() in the other. This will actually cause problems if a regular file named pkgdir already exists and replace is FALSE, because then the if clauses in the 2 if statements will both evaluate to FALSE, and the call to Biobase::createPackage() will fail with the old uninformative error message. So you really want to use file.exists() in both cases.

Also I'd suggest that you reorganize the if statements like this:

        if (file.exists(pkgdir)) {
            if (replace) {
                unlink(pkgdir, recursive=TRUE)
                dir.create(pkgdir)
            } else {
                stop("directory ", pkgdir, " exists. ",
                     "Use replace=TRUE to replace it.")
            }
        }

This makes the code easier to understand. In particular, using nested if statements (an if else statement inside an if statement) makes it immediately obvious that, if pkgdir exists, the 2 things that can happen are exclusive. With the 2 non-nested if statements, this is not immediately clear: the if clauses are more complicated to parse with the eyes, and one needs to pay close attention to them to realize that they are exclusive.

Priceless-P commented 1 year ago

Also I'd suggest that you reorganize the if statements like this: if (file.exists(pkgdir)) { if (replace) { unlink(pkgdir, recursive=TRUE) dir.create(pkgdir) } else { stop("directory ", pkgdir, " exists. ", "Use replace=TRUE to replace it.") } }

@hpages. You are absolutely right. This is a better way to do it.

But I am still missing something. This is my code

setGeneric("forgeBSgenomeDataPkg", signature="x",
    function(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE)
        standardGeneric("forgeBSgenomeDataPkg")
)

setMethod("forgeBSgenomeDataPkg", "BSgenomeDataPkgSeed",
    function(x, seqs_srcdir=".", destdir=".", replace=FALSE, verbose=TRUE)
    {
      if (!isTRUEorFALSE(replace))
        stop("'replace' must be TRUE or FALSE")

        ## The Biobase package is needed for createPackage().
        if (!requireNamespace("Biobase", quietly=TRUE))
            stop("Couldn't load the Biobase package. Please install ",
                 "the Biobase\n  package and try again.")
        template_path <- system.file("pkgtemplates", "BSgenome_datapkg",
                                     package="BSgenome")
        BSgenome_version <- installed.packages()['BSgenome','Version']
        if (is.na(x@genome)) {
            if (is.na(x@provider_version))
                stop("'genome' field is missing in seed file")
            warning("field 'provider_version' is deprecated ",
                    "in favor of 'genome'")
            x@genome <- x@provider_version
        }
        if (!is.na(x@release_name))
            warning("field 'release_name' is deprecated")
        seqnames <- x@seqnames
        if (!is.na(seqnames)) {
            .seqnames <- eval(parse(text=seqnames))
        } else {
            if (is.na(x@seqfile_name)) {
                .seqnames <- seqlevels(Seqinfo(genome=x@genome))
            } else {
                .seqnames <- NULL
            }
        }
        seqnames <- deparse1(.seqnames)
        circ_seqs <- x@circ_seqs
        if (!is.na(circ_seqs)) {
            .circ_seqs <- eval(parse(text=circ_seqs))
        } else {
            si <- Seqinfo(genome=x@genome)
            .circ_seqs <- seqlevels(si)[isCircular(si)]
        }
        circ_seqs <- deparse1(.circ_seqs)
        symvals <- list(
            PKGTITLE=x@Title,
            PKGDESCRIPTION=x@Description,
            PKGVERSION=x@Version,
            AUTHOR=x@Author,
            MAINTAINER=x@Maintainer,
            BSGENOMEVERSION=BSgenome_version,
            SUGGESTS=x@Suggests,
            LICENSE=x@License,
            ORGANISM=x@organism,
            COMMONNAME=x@common_name,
            GENOME=x@genome,
            PROVIDER=x@provider,
            RELEASEDATE=x@release_date,
            SOURCEURL=x@source_url,
            ORGANISMBIOCVIEW=x@organism_biocview,
            BSGENOMEOBJNAME=x@BSgenomeObjname,
            SEQNAMES=seqnames,
            CIRCSEQS=circ_seqs,
            MSEQNAMES=x@mseqnames,
            PKGDETAILS=x@PkgDetails,
            SRCDATAFILES=x@SrcDataFiles,
            PKGEXAMPLES=x@PkgExamples
        )
        ## Should never happen
        if (any(duplicated(names(symvals)))) {
            str(symvals)
            stop("'symvals' contains duplicated symbols")
        }
        ## All symvals should by single strings (non-NA)
        is_OK <- sapply(symvals, isSingleString)
        if (!all(is_OK)) {
            bad_syms <- paste(names(is_OK)[!is_OK], collapse=", ")
            stop("values for symbols ", bad_syms, " are not single strings")
        }

        pkgdir <- file.path(destdir, x@Package)
        if (file.exists(pkgdir)) {
          if (replace) {
            unlink(pkgdir, recursive=TRUE)
            dir.create(pkgdir)
          } else {
            stop("directory ", pkgdir, " exists. ",
                 "Use replace=TRUE to replace it.")
          }
        }

        Biobase::createPackage(x@Package, destdir, template_path, symvals)

I still get this error

>forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed")
Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  : 
  directory ./BSgenome.Cfamiliaris.UCSC.canFam6 exists. Use replace=TRUE to replace it.

> forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed",replace=TRUE)
Error in forgeBSgenomeDataPkg(y, seqs_srcdir = seqs_srcdir, destdir = destdir,  : 
  directory ./BSgenome.Cfamiliaris.UCSC.canFam6 exists. Use replace=TRUE to replace it.

So calling the function without replace works as expected but It doesn't give the expected result if the replace argument is added.

hpages commented 1 year ago

Why are you creating the pkgdir folder? The Biobase::createPackage() function takes care of creating that folder.

hpages commented 1 year ago

BTW I'm surprised that the second time you called forgeBSgenomeDataPkg() above (i.e. with replace=TRUE) you got that error. It seems that you should have gotten the old misleading error message instead i.e. the one coming from Biobase::createPackage(), which doesn't say exactly the same thing as what you're showing above.

Priceless-P commented 1 year ago

I removed dir.create(pkgdir) built, and installed it again and I still get the same error. Let me create a PR so you see the entire code.

BTW I'm surprised that the second time you called forgeBSgenomeDataPkg() above (i.e. with replace=TRUE) you got that error. It seems that you should have gotten the old misleading error message instead i.e. the one coming from Biobase::createPackage(), which doesn't say exactly the same thing as what you're showing above.

I know right? 🤷🏻‍♀️

hpages commented 1 year ago

When you call forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed") or forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed", replace=TRUE), you're calling the function on an object of type character, so the method for character objects will be used. Note that this method itself will call forgeBSgenomeDataPkg() again but this time on a list object (y). The problem is that this call doesn't pass down the replace argument, so the forgeBSgenomeDataPkg() method for list objects will be called with the default value for replace, which is FALSE.

Priceless-P commented 1 year ago

When you call forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed") or forgeBSgenomeDataPkg("BSgenome.Cfamiliaris.UCSC.canFam6-seed", replace=TRUE), you're calling the function on an object of type character, so the method for character objects will be used. Note that this method itself will call forgeBSgenomeDataPkg() again but this time on a list object (y). The problem is that this call doesn't pass down the replace argument, so the forgeBSgenomeDataPkg() method for list object will be called with the default value for replace, which is FALSE.

Ohhh. I think I just saw what you mean. Thank you for pointing that out. It's working fine now.

Also, is my documentation for the replace satisfactory? Do I need to provide a more detailed explanation about it?

hpages commented 1 year ago

Yes, it's satisfactory. What this argument does is very simple so explanation should be short and straight to the point. See #53 for more comments. Thanks!

Priceless-P commented 1 year ago

Okay. Thank you @hpages 😊

hpages commented 1 year ago

Should I take a second look at the PR?

In the meantime please take a look at my long comment here where I try to explain how the code in the fasta_to_sorted_2bit_for_Felis_catus_9.0.R script works. The code in your fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R script works the same way so my explanations apply to your script too.

It's a long due explanation, and it comes a little bit late for you now that we've closed issue #39, but it's never too late to take a close look at how the code in that script works. It discusses the notion of parallel objects which is central to a lot of the R code that we write in Bioconductor.

Be sure to let me know if you have questions.

Priceless-P commented 1 year ago

It's a long due explanation, and it comes a little bit late for you now that we've closed issue https://github.com/Bioconductor/BSgenome/issues/39, but it's never too late to take a close look at how the code in that script works. It discusses the notion of parallel objects which is central to a lot of the R code that we write in Bioconductor.

I will certainly go through it and come back to you with questions if anything is unclear.

hpages commented 1 year ago

Ok so no question on how the code in fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R works? Let's close this then.

Don't forget to record your contribution on Outreachy, thanks!

Priceless-P commented 1 year ago

Ok so no question on how the code in fasta_to_sorted_2bit_for_Dog10K_Boxer_Tasha.R works? Let's close this then.

Not I don't. Your explanation was very clear.

Don't forget to record your contribution on Outreachy, thanks!

Already did. Thank you!