Add 'organism' argument to registered_UCSC_genomes()

hpages commented 2 years ago

We propose to add the organism argument to the registered_UCSC_genomes() function. Like for the registered_NCBI_genomes() function, the user will be able to use this argument to restrict the results returned by the function to the specified organism.

More precisely:

The user will be able to pass a single string to organism.
The supplied string will be used to perform a search (with grep()) on the "organism" column of the data frame returned by registered_UCSC_assemblies(). The search should be case-insensitive.
Like for the registered_NCBI_genomes() function, the organism argument should be set to NA by default.

The new argument should be documented in GenomeInfoDb/man/getChromInfoFromUCSC.Rd.

It is encouraged to look at the code of the registered_NCBI_genomes() function where this feature is already implemented.

kakopo commented 2 years ago

Hello @hpages could I please work on this issue?

hpages commented 2 years ago

Sure. I'm sorry that I didn't answer your questions here yet. I got a little distracted by other commitments. I'll try to get back to this ASAP.

kakopo commented 2 years ago

Its no problem at all.

kakopo commented 2 years ago

Hi @hpages. I have been attempting to carry out this task while heavily taking reference from registered_NCBI_genomes() within the getChromInfoFromNCBI.R script. Carrying out brute force testing has not yielded any results. I have only been able to modify the function to add registered_UCSC_genomes <- function(organism=NA) which returns the same results as registered_UCSC_genomes <- function(). I would really appreciate an explanation of how registered_NCBI_genomes() works, such that it is able to return specific organisms. Thank you!

hpages commented 2 years ago

Hi @kakopo,

First you want to spend some time in the code of the registered_NCBI_assemblies() function, and try to understand what the code does with the organism argument. The function is implemented here.

To do this, we need to go thru the body of the function, line by line, and find the places where organism is used. We see that it's only used here (lines 163-164):

    if (!isSingleStringOrNA(organism))
        stop(wmsg("'organism' must be a single string or NA"))

and here (lines 192-195):

    if (!is.na(organism)) {
        keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
        DF <- DF[keep_idx, , drop=FALSE]
    }

That's all.

What are those lines doing?

The first block (lines 163-164) only checks that the value supplied by the user is a single string or NA (NA means no filtering). If it's not, then the function returns an error.

The second block (lines 192-195) is an if statement. The if clause in this statement (!is.na(organism)) means that the 2 lines inside the if block will be executed only if organism is not NA, that is, only if the user effectively passed a non-NA value to the organism argument (remember that the user does this with the intent to restrict the results returned by the function to the specified organism).

The 2 lines inside the if block are:

        keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
        DF <- DF[keep_idx, , drop=FALSE]

Given the purpose of the organism argument, we expect that these 2 lines are actually going to take care of restricting the results to the specified organism.

How do they do that? Well first of all, we see that they operate on a variable called DF. So we need to know a little bit more about this variable. We see that DF was created earlier in the function (at line 189) with:

    DF <- S4Vectors:::new_DataFrame(listData, nrows=length(assemblies))

Let's do some guess work here. Based on the name of the function that is called (new_DataFrame), it sounds like the function is probably returning an object of class DataFrame. We can do ?DataFrame to learn more about those objects. Do we need to know everything about those objects? Of course not. All we need to know is that they behave like ordinary data.frame objects in R. Also we note that the last line (line 196) in the body of the registered_NCBI_assemblies function is:

    as.data.frame(DF)

This tells us that a DataFrame object can simply be converted back to an ordinary data frame with as.data.frame(). Also, because this is the last line in the registered_NCBI_assemblies function, this is what the function will return. We've already used the function quite a bit and we already knew that it's supposed to return an ordinary data frame. But it's always satisfying to see how/where this happens in the code. Also it reassures us that we're not completely off with our interpretation of what the code is doing.

Ok so we know enough about DF to understand what the if block does on it:

First it calls grep() on organism and DF$organism. Note that the latter is the organism column of DF. So the former is a single string (the one supplied by the user), and the latter is a character vector (the organism column of the data frame to be returned to the user). The value returned by the call to grep() is stored in keep_idx. According to the man page for grep (see ?grep), the function will try to match its 1st argument (the pattern) against the elements of its 2nd argument (a character vector). And it will return an integer vector containing the indices of the elements in its 2nd argument where a match was found. This means that keep_idx will be an integer vector made of valid indices w.r.t. the DF$organism vector. And from this, we can infer that the indices in keep_idx are also valid row indices w.r.t. the DF data-frame-like object. In other words, we know that keep_idx can be used to subset DF by row.
Then it does DF <- DF[keep_idx, , drop=FALSE]. Do you recognize what this does? (Remember that for most operations, DataFrame objects behave like ordinary data.frame objects.)

Hopefully this will help you better understand how registered_NCBI_assemblies() handles the organism argument.

The registered_UCSC_genomes() function needs to do something very similar, that is:

Check that the organism value supplied by the user is a single string or NA, and return an error if it's not.
If organism is not NA, subset the data.frame that the function will return to the user. Like in registered_NCBI_assemblies(), this should happen towards the end, just before returning the data.frame to the user. Note that, for some reason, the DataFrame object created in the body of the registered_UCSC_genomes() function is named ans, which doesn't help. Feel free to rename it DF, like in the registered_NCBI_assemblies() function. This will help make the code a little bit clearer.

Let me know if you have further questions.

kakopo commented 2 years ago

Hi @hpages. Thank you for the explaining how registered_NCBI_assemblies() works, it was very helpful and much simpler than how I had previously approached it. I managed to edit the registered_UCSC_genomes(). The package builds and checks well, but the function itself returns this error:

> registered_UCSC_genomes(organism = "Felis catus")
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': subscript contains out-of-bounds indices

This is the current code for the second part of the function

    listData <- lapply(setNames(seq_along(colnames), colnames), make_col)
    DF <- S4Vectors:::new_DataFrame(listData, nrows=length(assemblies))
    genome_trailing_digits <- sub("(.*[^0-9])([0-9]*)$", "\\2", DF$genome)
    oo <- order(DF$organism, as.integer(genome_trailing_digits))
    if (!is.na(organism)) {
      keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
      DF <- DF[keep_idx, , drop=FALSE]
    }
    as.data.frame(DF)
}

I am unsure where exactly the code goes wrong, and the error itself is coming from.

hpages commented 2 years ago

Great, that's good progress!

The registered_UCSC_genomes() function is implemented here.

I can't see exactly what problem in your code could cause this error, I would need to investigate more. But I do see a problem: in your modified version of the function, oo is still computed but is no longer used. That can't be right! I don't know if that is what's causing the error you get though.

More on oo:

The last 2 lines in the body of the function (lines 620,621) are:

    oo <- order(ans$organism, as.integer(genome_trailing_digits))
    as.data.frame(ans[oo, , drop=FALSE])

The first line computes oo, whatever that is (we don't really need to know more), and the second line uses it. More precisely the second line performs 2 operations, in this order:

ans[oo, , drop=FALSE]
Call as.data.frame() on the result produced by 1.

For more clarity, the same code could have been written as follow (one operation per line), at the cost of making it a little bit longer. I'm also renaming ans -> DF, as you did, as this also helps comparing the code with getChromInfoFromNCBI's code:

    oo <- order(ans$organism, as.integer(genome_trailing_digits))
    DF <- DF[oo, , drop=FALSE]
    as.data.frame(DF)

I'm actually a big fan of the "one operation per line" approach.

Doesn't it start to look a lot more like what we see at the end of the getChromInfoFromNCBI() function? Of course it's not exactly the same, but the last line is exactly the same.

Make sure to preserve the DF[oo, , drop=FALSE] operation, and to add your if statement after that line. Remember that the filtering of the results based on the user-supplied organism should happen on the data-frame-like object that is about to be returned. In other words, your if statement should be the last transformation that is performed on DF, right before it goes thru as.data.frame().

Let me know how that goes.

kakopo commented 2 years ago

That awesome feeling when everything runs with no errors! Thank you! I did everything you recommended and it ran! I guess I just didn't understand well enough, and thought as.data.frame(ans[oo, , drop=FALSE]) was the same as as.data.frame(). I edited the code as well, to follow

For more clarity, the same code could have been written as follow (one operation per line), at the cost of making it a little bit longer. I'm also renaming ans -> DF, as you did, as this also helps comparing the code with getChromInfoFromNCBI's code:
oo <- order(ans$organism, as.integer(genome_trailing_digits))
DF <- DF[oo, , drop=FALSE]
as.data.frame(DF)
I'm actually a big fan of the "one operation per line" approach

And yes, I agree, the "one operation per line" approach is much better. Thank you once more! I am going to submit a PR. I hope all shall be satisfactory.

hpages commented 2 years ago

Perfect.

That awesome feeling when everything runs with no errors!

I know right? :smiley:

See #77 for some feedback on your PR. Thanks

hpages commented 2 years ago

Hi @kakopo, I just merged your PR (#77). We can close this. Good job!

Bioconductor / GenomeInfoDb

Add 'organism' argument to registered_UCSC_genomes() #72