consider making `guess_numerical_mark()` work with a vector rather than a file path

RMI-PACTA / pacta.portfolio.import

Functions that validate input “listed security” portfolios.

https://rmi-pacta.github.io/pacta.portfolio.import/

Other

2 stars 0 forks source link

consider making `guess_numerical_mark()` work with a vector rather than a file path #13

Open cjyetman opened 1 year ago

cjyetman commented 1 year ago

guess_numerical_mark() currently takes a filepath/s to a CSV and must read in the file and find the appropriate column/variable before achieving its primary function of guessing the numerical marks within a set of character strings that represent numbers. It could instead accept any character vector to process and leave the file read in and column selection to whatever calls it. The potential downside is that the calling function would then need to read in the data in a specific way (as a character vector with no manipulation) and find the column to be processed by guess_numerical_mark(), which may not be something that needs to happen otherwise.

Not 100% convinced this is the right thing to do.

related #7

jdhoffa commented 1 year ago

One option could be to refactor out a portion that works with a vector, and then wrap it in a function that also does the reading:

guess_numerical_mark()
read_and_guess_numberical_mark()

or something like that. Have your cake and eat it too.

cjyetman commented 8 months ago

Reviewing this, the existing paradigm in this repo is that the functions work on files. That makes me hesitant to change the functionality of guess_numerical_mark() to accept only a vector. That being said, conceptually this paradigm may not have been ideal, and I can see value in having very specific purpose utility functions that do very specific things on very specific inputs, possibly using a very specific naming scheme, and utilizing those utility functions in other higher level functions that combine functionality of the various utility functions.

For the time being, I will let this brew a while before deciding if this would be worth the effort.

edit: For clarification, I fully grasp that this could be achieved as @jdhoffa has suggested, I'm primarily concerned about the naming of the functions and what that would imply should be done to many of the other functions in this repo in order to retain consistency.

jdhoffa commented 8 months ago

For the record, I am absolutely in favor of careful consideration of the names and purposes of the functions!

The proposed solution was half an idea, but I think it's important to consider these things (especially names lol) as they are often trapdoor decisions that have long-term and potentially unexpected consequences (see the whole r2dii suite for example)

AlexAxthelm commented 8 months ago

I'm generally a fan of having a call stack along the lines of

foo <- function(filepaths) {
  # some checks on the filepaths
  vapply(
    x = filepaths,
    FUN = file_foo
  )
}

file_foo <- function(single_filepath) {
  stopifnot(length(single_filepath) == 1) # but with more informative message
  contents <- readLines(single_filepath, n = 1L)
  val <- string_foo(contents)
  return(val)
}

string_foo <- function(string) {
  grepl(pattern = "foo", x = string)
}

so in this case, the functions might be named guess_numerical_mark, file_guess_numerical_mark, and string_guess_numerical_mark (suffixing the input type would also work, i.e. guess_numerical_mark_string())

cjyetman commented 8 months ago

Yeah, totally agree (retrospectively). My only concern is how much effort it will be to apply that universally and consistently.

jdhoffa commented 8 months ago

Just some more thoughts on this:

I feel like whatever the outcome here, we should always lean towards fewer exported functions a package to limit maintenance burden
In that case, I would imagine that the "most useful" exported function would return a data-frame?
But internally, I think separating the function in this way would be clear, as you outline here, would make sense