eddelbuettel / rcppsimdjson

Rcpp Bindings for the 'simdjson' Header Library
115 stars 13 forks source link

sanitize path from ? upsetting file ops on windows (closes #66) #67

Closed eddelbuettel closed 3 years ago

eddelbuettel commented 3 years ago

We have already seen (in #66) that this is a minimally complete fix.

But while we're at it, should we substitute more? Can you think of another symbol or two?

codecov[bot] commented 3 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@c71e386). Click here to learn what that means. The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #67   +/-   ##
=========================================
  Coverage          ?   99.25%           
=========================================
  Files             ?       18           
  Lines             ?     1339           
  Branches          ?        0           
=========================================
  Hits              ?     1329           
  Misses            ?       10           
  Partials          ?        0           
Impacted Files Coverage Δ
R/validateJSON.R 100.00% <ø> (ø)
inst/include/RcppSimdJson/deserialize.hpp 99.00% <ø> (ø)
...nst/include/RcppSimdJson/deserialize/dataframe.hpp 100.00% <ø> (ø)
inst/include/RcppSimdJson/deserialize/vector.hpp 100.00% <ø> (ø)
inst/include/RcppSimdJson/utils.hpp 100.00% <ø> (ø)
R/utils.R 81.08% <75.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c71e386...09c2396. Read the comment docs.

knapply commented 3 years ago

This works, but there's a bit to unpack here.

It seems (some of) the options are:

.path_sanitize <- function(file_path, replacement = "") {
    stopifnot(is.character(file_path) && !anyNA(file_path))
    case_sensitivity <- c(
        "[/\\?<>\\:*|\":]" = FALSE,
        "[[:cntrl:]]" = FALSE,
        "^[.]+$" = FALSE,
        "^(con|prn|aux|nul|com[0-9]|lpt[0-9])([.].*)?$" = TRUE,
        "[. ]+$" = FALSE
    )
    target_patterns <- names(case_sensitivity)
    for (i in seq_along(target_patterns)) {
        file_path <- gsub(pattern = target_patterns[[i]], 
                          replacement = replacement, 
                          x = file_path, 
                          ignore.case = case_sensitivity[[i]])
    }
    file_path <- substr(file_path, 1L, 255L)
    if (replacement == "") {
        file_path
    } else {
        path_sanitize(file_path, "")
    }
}

Unfortunately, C++17's <filesystem> seems to require some linking hackery for GCC<8, so just relying on that probably isn't realistic.

But there's a disconnect between this use case and fload()'s original goals (which obviously drove its design). fload() being able to load files from a URL was really only meant to make downloading files convenient, but the real issue here is that this isn't downloading a file (there's no real file name). Obviously use patterns expanded since we started (I even did an API query in the examples), but I feel like we should avoid reinventing the wheel internally when we really just wanted a wicked fast JSON parser.

eddelbuettel commented 3 years ago

Yes/no/maybe. One could also argue that the real issue may be constructing a file name by trusting a string we're given, and we now found that in some cases this can in fact bite us. So would a fuller sanitization be to just call tempfile() and simply not feed it a pattern which may contain indestigable components?

knapply commented 3 years ago

That should work, the only issue I can think of is regarding this current behavior:

library(RcppSimdJson)

a_url <- "https://api.github.com/users/lemire"
multiple_urls <- c(
    a_url,
    "https://api.github.com/users/eddelbuettel",
    "https://api.github.com/users/knapply",
    "https://api.github.com/users/dcooley"
)
fload(multiple_urls, query = "/name")
#> $lemire
#> [1] "Daniel Lemire"
#> 
#> $eddelbuettel
#> [1] "Dirk Eddelbuettel"
#> 
#> $knapply
#> [1] "Brendan Knapp"
#> 
#> $dcooley
#> [1] "Dave"

Right now, the names() use the "file name". I suppose a way around that would be to check file names for illegal characters and just replace any strings that contain them with tempfile() (which is at least simpler to maintain internally). We could also just ditch this behavior.

eddelbuettel commented 3 years ago

Could we in that case not set names? (I understand that I am pushing your spec a little with my (ab)use from package td. I just checked and td is equally happy to use format=csv but as, always, "that's why we can't have nice things" this then prohibits batch use over multiple symbols. Gaa.)

The name settings also had a minor hoopla here in that we set the key directly in the URL as a 'transient' sacred payload and then we ... promptly get it printed back as the tempfile name. Whoops. It doesn't really spill but there are other cases where the 'URL defines filename' use has minor warts.

(And we may also decide that there is no clear winner either way and just keep the one-off filter I added. We can always expand as needed. Keeping it simple(r) ? )

eddelbuettel commented 3 years ago

Will merge this now and plan to ship to CRAN 'soon' (say, tomorrow?) to plug the hole we have on Windows. That is of moderate urgency, A better more thorough sanitizer is a good idea but we don't have to rush it.