Closed LiNk-NY closed 2 years ago
Nice.
Please use parenthesis here for readability:
not_missing <- !is.na(ans_start) | is.na(ans_end)
I think that the check for NAs should be performed even when na.rm=FALSE
, and, if NAs are present, makeGRangesFromDataFrame()
should fail early with an error message that suggests the use of na.rm=TRUE
. This would produce a nicer and more useful error message than what we get at the moment (which comes from the IRanges()
constructor).
Thanks!
Hi Hervé, I've updated the code based on your feedback. Thanks again!
Thanks Marcel.
A few more things:
Please check the na.rm
argument with the standard:
if (!isTRUEorFALSE(na.rm))
stop(wmsg("'na.rm' must be TRUE or FALSE"))
There's a general effort in the S4Vectors stack to not let any argument loose.
About the error message: with your patch start.field
and end.field
now can contain NAs so maybe it's more accurate to say something like:
stop(wmsg("'start.field' and/or 'end.field' contain NAs. ",
"Use 'na.rm=TRUE' to ignore the rows ",
"that contain these NAs."))
Also I think it might be slightly more useful to report the actual names of the start.field
and end.field
columns in case people want to check their data.frame or input file. At least it's consistent with how internal helper .get_data_frame_col_as_numeric()
reports problems in these columns:
start_colname <- names(df)[[granges_cols[["start"]]]]
end_colname <- names(df)[[granges_cols[["end"]]]]
stop(wmsg("The \"", start_colname, "\" and/or \"", end_colname, "\" ",
"columns contain NAs. Use 'na.rm=TRUE' to ignore the rows ",
"that contain these NAs."))
2 performance-related things:
not_missing <- !(is.na(ans_start) | is.na(ans_end))
if (any(!not_missing)) {
if (!na.rm) {
ERROR
}
SUBSETTING
}
ans_start <- sample(c(1:50, NA), 2e7, replace=TRUE)
ans_end <- ans_start + 99L
df <- data.frame(start=ans_start, end=ans_end, score=runif(2e7))
not_missing <- !(is.na(ans_start) | is.na(ans_end))
system.time({ df1 <- df[not_missing, , drop=FALSE] ans_start1 <- ans_start[not_missing] ans_end1 <- ans_end[not_missing] })
system.time({ keep_idx <- which(not_missing) df2 <- df[keep_idx, , drop=FALSE] ans_start2 <- ans_start[keep_idx] ans_end2 <- ans_end[keep_idx] })
Note that converting the logical subscript into a vector of positive integers is only worth it if the subscript is going to be used more than once, which is the case here. The speedup is not that great in this particular case, but it becomes more interesting if we replace `df[keep_idx, , drop=FALSE]` with `S4Vectors:::extract_data_frame_rows(df, keep_idx)`:
system.time({ keep_idx <- which(not_missing) df3 <- S4Vectors:::extract_data_frame_rows(df, keep_idx) ans_start3 <- ans_start[keep_idx] ans_end3 <- ans_end[keep_idx] })
Sorry this took me so long. I hadn't noticed the reply. Thanks for reviewing! Best, Marcel
Thanks Marcel!
Hi Hervé,
I've added this option to
makeGRangesFromDataFrame
to make it a bit more convenient to use. It isFALSE
by default.I'd be interested in your feedback.
Best, Marcel