Add unit tests for BSgenomeForge:::.get_circ_seqs()

hpages commented 1 year ago

We're going to use the testthat framework to implement the unit tests for BSgenomeForge:::.get_circ_seqs().

testthat is a CRAN package that makes it easy to write and run unit tests. It's used by many Bioconductor packages. Note that some Bioconductor packages use RUnit for their unit tests, which is similar to testthat but has been around for much longer. Choosing between testthat, or RUnit, or any other testing framework, is the developer's choice.

Here are some examples of Bioconductor packages that use testthat: SingleCellExperiment, ensembldb, DESeq2, etc... All these packages contain a tests/testthat/ folder which is where their unit tests are implemented. Don't hesitate to take a look at what they've done.

The unit tests can be run interactively in various ways e.g. by starting R in the directory of the package that you want to test and calling devtools::test(), or with Cmd/Ctrl + Shift + T if you are in RStudio. See https://github.com/r-lib/testthat#readme Note that they will also be run by R CMD check.

BSgenomeForge is already set up to run its unit tests thru testthat, that is:

It already has testthat in its Suggests field.
It already has a tests/ folder with the following content: a testthat.R file and a testthat/ subfolder. The testthat.R file should always contain the same 3 lines:
```
library(testthat)
library(NameOfMyPackage)

test_check("NameOfMyPackage")
```
The testthat/ subfolder is where the tests are actually implemented.

A good way to organize the tests in tests/testthat/ is to create one test_*.R file per function to test. So for the get_circ_seqs() function, let's put all the tests in tests/testthat/test_get_circ_seqs.R.

Let me know, or ask on Slack (our internal Slack or the community-bioc Slack), if you have any questions or need help with this.

H.

kakopo commented 1 year ago

Hi Hervé, please check my latest pull request regarding this issue and let me know what you think. Thank you!

hpages commented 1 year ago

Nice job with your first unit tests!

A few comments:

I prefer that we use test- rather than test_ for the prefix of the test files. That's because what follows test- can actually contain underscores but no dashes, e.g. test-get_circ_seqs.R, so it makes it easier for the eyes to immediately recognize the name of the function being tested.
Note that if you create an alias for the BSgenomeForge:::.get_circ_seqs function with
```
get_circ_seqs <- BSgenomeForge:::.get_circ_seqs
```
then you can use get_circ_seqs() instead of BSgenomeForge:::.get_circ_seqs() everywhere in your unit tests, which is convenient and makes the code a little nicer on the eye. The best place to put the above line is at the very beginning of the test-get_circ_seqs.R file.
No need to use numbered variables test1, test2, test3, etc.. for the values returned by the various calls to get_circ_seqs(). This will add a burden to the long-term maintenance of the tests e.g. when tests are added and/or removed then someone will need to adjust all these numbers. The returned value can be stored in the same variable, that is absolutely not a problem. The name of this variable can be very generic like res (for result), or more specific e.g. circ_seqs here.
I like that you did put a line of comment at the begining of each test_that() block e.g.
```
## NCBI assembly Vibrio cholerae
```
I think it would be nice if the comment could also say what the real circular sequences are e.g.:
```
## NCBI assembly Vibrio cholerae (3 circular sequences: CP043554.1, CP043556.1, CP043555.1)
```
or
```
## NCBI assembly Acidianus infernus (no circular sequences)
```
This would facilitate the maintenance of the tests.
I don't think your tests are covering the following situation: unregistered assembly with assembled molecules, but no circular sequences.
Using a regular expression like "This assembly|circular sequences" in your expect_error() calls is not strict enough. The pipe character (|) means "or" so the test will succeed if the error message returned by get_circ_seqs() contains "This assembly" or "circular sequences". The problem is that get_circ_seqs() can return many different error messages and most of them contain "circular sequences". One of the goals that we are trying to achieve with expect_error() is to make sure that the call to get_circ_seqs() not only returns an error, but also that it returns the right error message. In other words you want to use a regular expression that will only match the specific error message that you expect. One way to achieve this is to use a regular expression that recognizes the most important words in the message. For example, for error message:
```
Error in .check_no_circ_seqs(circ_seqs) :
  This assembly does not contain assembled molecules so it cannot have
  circular sequences.
```
you could use:
```
regexp <- "cannot.+have.+circular.+sequences"
```
By separating the important words with .+ we allow some characters to be between the words (in this case they are separated with whitespaces). If we really want to make sure that the words are separated with whitespaces, we could use:
```
regexp <- "cannot[\\s]+have[\\s]+circular[\\s]+sequences"
```
However since \\s is specific to Perl-style regular expressions, then we also must specify perl=TRUE in the call to expect_error().
I also think that it's good practice to use ignore.case=TRUE in expect_error() because we don't really care about the case used in the error messages. More precisely we don't want the tests to start failing in the future just because someone modified the case of a couple of words in an error message.

So, to summarize, here is what an expect_error() test would look like:

regexp <- "cannot[\\s]+have[\\s]+circular[\\s]+sequences"
expect_error(get_circ_seqs("GCA_009729545.1", chrominfo, c("MT")), regexp, ignore.case=TRUE, perl=TRUE)

Should be "for a registered assembly" instead of "for an registered assembly".

Thanks!

kakopo commented 1 year ago

Okay. Thank you for the feedback. Please have a look at the latest commit!

hpages commented 1 year ago

I merged PR #26 and made some minor edits to tests/testthat/test-get_circ_seqs.R. See commit f5af9971c36c2de98b793dba8e080a93e26ee25f. Please take a look and let me know if you have questions. Also don't forget to resync your fork.

Note that the reason I replaced regexp "registered[\\s]+in[\\s]+the[\\s]+GenomeInfoDb[\\s]+package" with "not[\\s]+registered" is that the former would also accept one of the error messages that start with "This assembly is registered in the GenomeInfoDb package ", and we don't want that (i.e. the test should fail if that's the case). The test should only succeed if the error message says that the assembly is NOT registered. Hope that makes sense.

Also it seems to me that the tests don't cover the case where the user supplies circular sequences that are valid sequence names but are not assembled molecules. This corresponds to this code (at the bottom of get_circ_seqs()):

    if (!all(circ_seqs %in% assembled_molecules))
        stop(wmsg("all the sequence names in 'circ_seqs' must be ",
                  "names of assembled molecules"))

Can you double check that? This would need to be covered too.

Please make the correction in a new PR if you need to add this test.

Let me know if you have questions.

Thanks!

kakopo commented 1 year ago

Note that the reason I replaced regexp "registered[\s]+in[\s]+the[\s]+GenomeInfoDb[\s]+package" with "not[\s]+registered" is that the former would also accept one of the error messages that start with "This assembly is registered in the GenomeInfoDb package ", and we don't want that (i.e. the test should fail if that's the case). The test should only succeed if the error message says that the assembly is NOT registered. Hope that makes sense.

This makes sense. Thank you. I've added the correction and created a new PR, please check it out, thanks.

hpages commented 1 year ago

Looks good. Please make sure that your code is properly indented. Thanks!

kakopo commented 1 year ago

Sorry about that! It should be properly indented now.

hpages commented 1 year ago

Thanks. I just merged PR #27 (please resync your fork).

Whenever you are ready, take a look at issue #25. As always, don't hesitate to ask if you need help.

Bioconductor / BSgenomeForge

Add unit tests for BSgenomeForge:::.get_circ_seqs() #17