add na.rm argument to makeGRangesFromDataFrame

LiNk-NY commented 3 years ago

Hi Hervé,

I've added this option to makeGRangesFromDataFrame to make it a bit more convenient to use. It is FALSE by default.

I'd be interested in your feedback.

Best, Marcel

hpages commented 3 years ago

Nice.

Please use parenthesis here for readability:

not_missing <- !is.na(ans_start) | is.na(ans_end)

I think that the check for NAs should be performed even when na.rm=FALSE, and, if NAs are present, makeGRangesFromDataFrame() should fail early with an error message that suggests the use of na.rm=TRUE. This would produce a nicer and more useful error message than what we get at the moment (which comes from the IRanges() constructor).

Thanks!

LiNk-NY commented 3 years ago

Hi Hervé, I've updated the code based on your feedback. Thanks again!

hpages commented 3 years ago

Thanks Marcel.

A few more things:

Please check the na.rm argument with the standard:
```
if (!isTRUEorFALSE(na.rm))
    stop(wmsg("'na.rm' must be TRUE or FALSE"))
```
There's a general effort in the S4Vectors stack to not let any argument loose.

About the error message: with your patch start.field and end.field now can contain NAs so maybe it's more accurate to say something like:

stop(wmsg("'start.field' and/or 'end.field' contain NAs. ",
          "Use 'na.rm=TRUE' to ignore the rows ",
          "that contain these NAs."))

Also I think it might be slightly more useful to report the actual names of the start.field and end.field columns in case people want to check their data.frame or input file. At least it's consistent with how internal helper .get_data_frame_col_as_numeric() reports problems in these columns:

start_colname <- names(df)[[granges_cols[["start"]]]]
end_colname <- names(df)[[granges_cols[["end"]]]]
stop(wmsg("The \"", start_colname, "\" and/or \"", end_colname, "\" ",
          "columns contain NAs. Use 'na.rm=TRUE' to ignore the rows ",
          "that contain these NAs."))

2 performance-related things:
- (1) No need to subset anything if there are no NAs (even if this subsetting is a no-op, it can have a singificant cost), so:
```
not_missing <- !(is.na(ans_start) | is.na(ans_end))
if (any(!not_missing)) {
if (!na.rm) {
    ERROR
}
SUBSETTING
}
```
- (2) AFAIK subsetting is always more efficient when the subscript is a vector of positive integers:
```
ans_start <- sample(c(1:50, NA), 2e7, replace=TRUE)
ans_end   <- ans_start + 99L
df <- data.frame(start=ans_start, end=ans_end, score=runif(2e7))
not_missing <- !(is.na(ans_start) | is.na(ans_end))
```
system.time({ df1 <- df[not_missing, , drop=FALSE] ans_start1 <- ans_start[not_missing] ans_end1 <- ans_end[not_missing] })

user system elapsed

2.180 0.620 2.802

system.time({ keep_idx <- which(not_missing) df2 <- df[keep_idx, , drop=FALSE] ans_start2 <- ans_start[keep_idx] ans_end2 <- ans_end[keep_idx] })

user system elapsed

1.784 0.332 2.116
```
Note that converting the logical subscript into a vector of positive integers is only worth it if the subscript is going to be used more than once, which is the case here. The speedup is not that great in this particular case, but it becomes more interesting if we replace `df[keep_idx, , drop=FALSE]` with `S4Vectors:::extract_data_frame_rows(df, keep_idx)`:
```
system.time({ keep_idx <- which(not_missing) df3 <- S4Vectors:::extract_data_frame_rows(df, keep_idx) ans_start3 <- ans_start[keep_idx] ans_end3 <- ans_end[keep_idx] })

user system elapsed

0.889 0.345 1.233

LiNk-NY commented 2 years ago

Sorry this took me so long. I hadn't noticed the reply. Thanks for reviewing! Best, Marcel

hpages commented 2 years ago

Thanks Marcel!

Bioconductor / GenomicRanges

add na.rm argument to makeGRangesFromDataFrame #59

user system elapsed

2.180 0.620 2.802

user system elapsed

1.784 0.332 2.116

user system elapsed

0.889 0.345 1.233