Closed hpages closed 2 years ago
Hello @hpages could I please work on this issue?
Sure. I'm sorry that I didn't answer your questions here yet. I got a little distracted by other commitments. I'll try to get back to this ASAP.
Its no problem at all.
Hi @hpages. I have been attempting to carry out this task while heavily taking reference from registered_NCBI_genomes()
within the getChromInfoFromNCBI.R
script. Carrying out brute force testing has not yielded any results. I have only been able to modify the function to add registered_UCSC_genomes <- function(organism=NA)
which returns the same results as registered_UCSC_genomes <- function()
. I would really appreciate an explanation of how registered_NCBI_genomes()
works, such that it is able to return specific organisms. Thank you!
Hi @kakopo,
First you want to spend some time in the code of the registered_NCBI_assemblies()
function, and try to understand what the code does with the organism
argument. The function is implemented here.
To do this, we need to go thru the body of the function, line by line, and find the places where organism
is used. We see that it's only used here (lines 163-164):
if (!isSingleStringOrNA(organism))
stop(wmsg("'organism' must be a single string or NA"))
and here (lines 192-195):
if (!is.na(organism)) {
keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
DF <- DF[keep_idx, , drop=FALSE]
}
That's all.
What are those lines doing?
The first block (lines 163-164) only checks that the value supplied by the user is a single string or NA
(NA
means no filtering). If it's not, then the function returns an error.
The second block (lines 192-195) is an if
statement. The if
clause in this statement (!is.na(organism)
) means that the 2 lines inside the if
block will be executed only if organism
is not NA
, that is, only if the user effectively passed a non-NA
value to the organism
argument (remember that the user does this with the intent to restrict the results returned by the function to the specified organism).
The 2 lines inside the if
block are:
keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
DF <- DF[keep_idx, , drop=FALSE]
Given the purpose of the organism
argument, we expect that these 2 lines are actually going to take care of restricting the results to the specified organism.
How do they do that? Well first of all, we see that they operate on a variable called DF
. So we need to know a little bit more about this variable. We see that DF
was created earlier in the function (at line 189) with:
DF <- S4Vectors:::new_DataFrame(listData, nrows=length(assemblies))
Let's do some guess work here. Based on the name of the function that is called (new_DataFrame
), it sounds like the function is probably returning an object of class DataFrame. We can do ?DataFrame
to learn more about those objects. Do we need to know everything about those objects? Of course not. All we need to know is that they behave like ordinary data.frame objects in R. Also we note that the last line (line 196) in the body of the registered_NCBI_assemblies
function is:
as.data.frame(DF)
This tells us that a DataFrame object can simply be converted back to an ordinary data frame with as.data.frame()
. Also, because this is the last line in the registered_NCBI_assemblies
function, this is what the function will return. We've already used the function quite a bit and we already knew that it's supposed to return an ordinary data frame. But it's always satisfying to see how/where this happens in the code. Also it reassures us that we're not completely off with our interpretation of what the code is doing.
Ok so we know enough about DF
to understand what the if
block does on it:
grep()
on organism
and DF$organism
. Note that the latter is the organism
column of DF
. So the former is a single string (the one supplied by the user), and the latter is a character vector (the organism
column of the data frame to be returned to the user). The value returned by the call to grep()
is stored in keep_idx
. According to the man page for grep
(see ?grep
), the function will try to match its 1st argument (the pattern) against the elements of its 2nd argument (a character vector). And it will return an integer vector containing the indices of the elements in its 2nd argument where a match was found. This means that keep_idx
will be an integer vector made of valid indices w.r.t. the DF$organism
vector. And from this, we can infer that the indices in keep_idx
are also valid row indices w.r.t. the DF
data-frame-like object. In other words, we know that keep_idx
can be used to subset DF
by row. DF <- DF[keep_idx, , drop=FALSE]
. Do you recognize what this does? (Remember that for most operations, DataFrame objects behave like ordinary data.frame objects.)Hopefully this will help you better understand how registered_NCBI_assemblies()
handles the organism
argument.
The registered_UCSC_genomes()
function needs to do something very similar, that is:
organism
value supplied by the user is a single string or NA
, and return an error if it's not.organism
is not NA
, subset the data.frame that the function will return to the user. Like in registered_NCBI_assemblies()
, this should happen towards the end, just before returning the data.frame to the user. Note that, for some reason, the DataFrame object created in the body of the registered_UCSC_genomes()
function is named ans
, which doesn't help. Feel free to rename it DF
, like in the registered_NCBI_assemblies()
function. This will help make the code a little bit clearer.Let me know if you have further questions.
Hi @hpages. Thank you for the explaining how registered_NCBI_assemblies()
works, it was very helpful and much simpler than how I had previously approached it. I managed to edit the registered_UCSC_genomes()
. The package builds and checks well, but the function itself returns this error:
> registered_UCSC_genomes(organism = "Felis catus")
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': subscript contains out-of-bounds indices
This is the current code for the second part of the function
listData <- lapply(setNames(seq_along(colnames), colnames), make_col)
DF <- S4Vectors:::new_DataFrame(listData, nrows=length(assemblies))
genome_trailing_digits <- sub("(.*[^0-9])([0-9]*)$", "\\2", DF$genome)
oo <- order(DF$organism, as.integer(genome_trailing_digits))
if (!is.na(organism)) {
keep_idx <- grep(organism, DF$organism, ignore.case=TRUE)
DF <- DF[keep_idx, , drop=FALSE]
}
as.data.frame(DF)
}
I am unsure where exactly the code goes wrong, and the error itself is coming from.
Great, that's good progress!
The registered_UCSC_genomes()
function is implemented here.
I can't see exactly what problem in your code could cause this error, I would need to investigate more. But I do see a problem: in your modified version of the function, oo
is still computed but is no longer used. That can't be right! I don't know if that is what's causing the error you get though.
More on oo
:
The last 2 lines in the body of the function (lines 620,621) are:
oo <- order(ans$organism, as.integer(genome_trailing_digits))
as.data.frame(ans[oo, , drop=FALSE])
The first line computes oo
, whatever that is (we don't really need to know more), and the second line uses it. More precisely the second line performs 2 operations, in this order:
ans[oo, , drop=FALSE]
as.data.frame()
on the result produced by 1.For more clarity, the same code could have been written as follow (one operation per line), at the cost of making it a little bit longer. I'm also renaming ans
-> DF
, as you did, as this also helps comparing the code with getChromInfoFromNCBI
's code:
oo <- order(ans$organism, as.integer(genome_trailing_digits))
DF <- DF[oo, , drop=FALSE]
as.data.frame(DF)
I'm actually a big fan of the "one operation per line" approach.
Doesn't it start to look a lot more like what we see at the end of the getChromInfoFromNCBI()
function? Of course it's not exactly the same, but the last line is exactly the same.
Make sure to preserve the DF[oo, , drop=FALSE]
operation, and to add your if
statement after that line. Remember that the filtering of the results based on the user-supplied organism
should happen on the data-frame-like object that is about to be returned. In other words, your if
statement should be the last transformation that is performed on DF
, right before it goes thru as.data.frame()
.
Let me know how that goes.
That awesome feeling when everything runs with no errors! Thank you! I did everything you recommended and it ran! I guess I just didn't understand well enough, and thought as.data.frame(ans[oo, , drop=FALSE])
was the same as as.data.frame()
.
I edited the code as well, to follow
For more clarity, the same code could have been written as follow (one operation per line), at the cost of making it a little bit longer. I'm also renaming ans -> DF, as you did, as this also helps comparing the code with getChromInfoFromNCBI's code:
oo <- order(ans$organism, as.integer(genome_trailing_digits)) DF <- DF[oo, , drop=FALSE] as.data.frame(DF)
I'm actually a big fan of the "one operation per line" approach
And yes, I agree, the "one operation per line" approach is much better. Thank you once more! I am going to submit a PR. I hope all shall be satisfactory.
Perfect.
That awesome feeling when everything runs with no errors!
I know right? :smiley:
See #77 for some feedback on your PR. Thanks
Hi @kakopo, I just merged your PR (#77). We can close this. Good job!
We propose to add the
organism
argument to theregistered_UCSC_genomes()
function. Like for theregistered_NCBI_genomes()
function, the user will be able to use this argument to restrict the results returned by the function to the specified organism.More precisely:
organism
.grep()
) on the"organism"
column of the data frame returned byregistered_UCSC_assemblies()
. The search should be case-insensitive.registered_NCBI_genomes()
function, theorganism
argument should be set toNA
by default.The new argument should be documented in
GenomeInfoDb/man/getChromInfoFromUCSC.Rd
.It is encouraged to look at the code of the
registered_NCBI_genomes()
function where this feature is already implemented.