Bioconductor / BiocCheck

http://bioconductor.org/packages/BiocCheck
8 stars 26 forks source link

Fix ORCID ID check #105

Closed FreekManders closed 4 years ago

FreekManders commented 4 years ago

The ORCID ID check didn't accept valid IDs ending with an X. This is now fixed. IDs that are too long are now also rejected. I updated the NEWS file as requested and bumped the version number of the DESCRIPTION. I didn't update the vignette, because I didn't add any new features.

name: Fix ORCID ID check about: Fix ORCID ID check title: "[PR] Fix ORCID ID check" labels: '' assignees: ''

If you are submitting a pull request to BiocCheck please follow the instructions outlined in this presentation. This presentation includes steps for forking, creating a branch for you to work on, and useful related information.

Prior to sending the pull request, please verify that R CMD build and R CMD check run without warnings or errors on the latest Bioconductor-devel (currently in May 2020 that would be Bioconductor 3.12) and complete the following steps:

The easiest way to obtain a working environment with Bioconductor-devel in your computer is to use the Bioconductor devel docker image as described in detail in the Bioconductor website.

For more questions, please get in touch with the Bioconductor core team through the Bioconductor Slack workspace.

mtmorgan commented 4 years ago

Seems like there is a good opportunity for a formal unit test here.

Here are some test ids and a vector indicating whether they are valid or not

orcid <- c(
    "0000-0001-6197-3471",
    "0000-0001-6197-347X",
    "0000-0001-6197-34XX",
    "",
    NA_character_
)
valid <- c(TRUE, TRUE, FALSE, FALSE, FALSE)

Here's the current regular expression and test, using the RUnit package (used by BiocCheck, in the inst/unitTests directory)

re <- "^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$"
checkIdentical(valid, grepl(re, orcid))

The test fails (good!) Here's the proposed patch

re- "^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9X]{4}$"
checkIdentical(valid, grepl(re, orcid))

but it also fails (because it allows the second example). So trying again

re <- "^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[0-9X]$"
checkIdentical(valid, grepl(re, orcid))

Success! (but maybe I haven't tried hard enough to break it?)

And then a more compact variant

re <- "^([0-9]{4}-){3}[0-9]{3}[0-9X]$"
checkIdentical(valid, grepl(re, orcid))

More success (but is it readable?)

Is there a more comprehensive suite of possible ORCID ids that trip up the current successes? Is there something more readable than a regular expression test?

The test should be incorporated into the inst/unitTests directory, but to do so I think it actually helps to define in the R/ directory a little helper function

.check_ORCID <- function(orcid) {
    re <- "^([0-9]{4}-){3}[0-9]{3}[0-9X]$"
    grepl(re, orcid)
}

and update the code in the main BiocCheck function to call that. Then the test is easy to implement without having to go through some of the contortions in other parts of the tests files, e.g., no need to create a 'mock' package....

FreekManders commented 4 years ago

Yes, that's better. I didn't think about cases where there are two X's. I do think the more compact version is less readable. The longer version better shows the structure of the ID. Shall I implement these changes or do you want to do this?

mtmorgan commented 4 years ago

Would be great if you wanted to implement the changes, @FreekManders

FreekManders commented 4 years ago

I added the changes. I had to clarify the namespace in the unit test. I guess that is because I'm testing a non-exported function, so it shouldn't be a problem.