Open maurolepore opened 3 years ago
Yeah, these need to get cleaned up eventually, but I'd hesitate believing that there's one-size-fits-all solution for every instance of this. Let's approach each variation specifically. Maybe we can identify them here as we find them?
Okay. I think whether or not we review all if()
statements or tackle them as we bump into them is a strategic decision about how we tackle technical debt and therefore a decision that @AlexAxthelm will likely prefer to make. I would then close this issue because it'll be sitting here for a while and we already know there is a lot of technical debt.
Before I close I paste a snippet of a global search for "if(" in .R files. Just here I see four instances of ==
inside an if()
statement. Compared to identical()
, using ==
inside an if()
statement is suspicious because it can return objects of length > 1. Therefore, all such if()
statements are good candidates for review.
Personally, I don't see an issue like this --with a reasonable goal, well thought options for how to fix them, and some preliminary investigation of where the fixes need to happen-- as tech debt, though I admittedly don't know nor really care so much what the precise definition of that is, so I could be wrong, but in any case, I would leave this issue open.
I'm also not opposed to fixing these all at once, only opposed to blindly applying the same "fix" to all of them... as you've already pointed out, in some cases all()
or any()
might be more appropriate than identical()
or vector[[1]] == "something"
... it might not be obvious which is most appropriate in each case and require deeper investigation.
1: I think this qualifies as "tech debt" in the sense that the system (mostly) works now, but this could cause us problems in the future.
2: I agree that resolving this issue is going to be handled on a case-by-case basis. Maybe we can make a coordinated effort to tackle this, but If they haven't been causing (many) serious issues yet, I think that addressing these whenever we're touching those files anyway seems like a reasonable approach.
@AlexAxthelm does that mean that this issue thread being open is tech debt, or that the problem that some if
statements are not well defined is tech debt?
I view the if
statements as tech debt. It opens up an avenue for errors, but it seems like #436 is the only one that has popped up as an actual problem so far.
I dug into this today a bit. Just want to note here that in some places, especially when there's a default value, I actually prefer the warning in the if
statement because it reveals something that should be fixed upstream, rather than obscuring it by circumventing the problem and passing on garbage. For instance, in the code below, financial_timestamp
conceptually should never be more than a single character element. If somehow we make a change that causes financial_timestamp
to have more than one element under specific conditions, I would rather the below conditionals use the first element and throw a warning than simply pass garbage back and not leave any message.
example...
test <- function(financial_timestamp) {
if (financial_timestamp == "2017Q4") {
"31.12.2017"
} else if (financial_timestamp == "2018Q4") {
"31.12.2018"
} else if (financial_timestamp == "2019Q4") {
"31.12.2019"
} else {
financial_timestamp
}
}
test2 <- function(financial_timestamp) {
if (identical(financial_timestamp, "2017Q4")) {
"31.12.2017"
} else if (identical(financial_timestamp, "2018Q4")) {
"31.12.2018"
} else if (identical(financial_timestamp, "2019Q4")) {
"31.12.2019"
} else {
financial_timestamp
}
}
test(c("2018Q4", "2019Q4"))
#> [1] "31.12.2018"
#> Warning messages:
#> 1: In if (financial_timestamp == "2017Q4") { :
#> the condition has length > 1 and only the first element will be used
#> 2: In if (financial_timestamp == "2018Q4") { :
#> the condition has length > 1 and only the first element will be used
test2(c("2018Q4", "2019Q4"))
#> [1] "2018Q4" "2019Q4"
I think this is a good example of why it's important to document assumptions, even implicitly.
for example, I would say that either of those functions is fine, with the following changes:
# this code not tested, for illustrative purpose only.
validate_financial_timestamp <- function(financial_timestamp) {
# financial timestamp should be a single point in time, in form of YYYYQN,
# where YYYY is a 4 digit year, and N is a number 1-4 (inclusive)
stopifnot(length(financial_timestamp) == 1L) # could also use assert, if available
stopifnot(grepl(x = financial_timestamp, pattern = "^[:digit]{4}Q[1-4]$"))
return(invisible(financial_timestamp))
}
test <- function(financial_timestamp) {
validate_financial_timestamp(financial_timestamp)
if (financial_timestamp == "2017Q4") {
"31.12.2017"
} else if (financial_timestamp == "2018Q4") {
"31.12.2018"
} else if (financial_timestamp == "2019Q4") {
"31.12.2019"
} else {
financial_timestamp
}
}
basically, validation functions should be quick to run, and therefore, we're willing to use them more frequently (basically every time we ingest an object through a public function). The other major alternative to this would be to take a more OOP based approach, but that's a can of worms that I don't know if we want to go down.
436 is one example of a problem I've seen multiple times. Maybe we should review all calls to
if()
and ensure they are all fed objects of length 1?This is a likely source of bugs and the fix is fairly straight forward. Nicely, it will force us to review the logic of our programs and decide when we should wrap in
all()
,any()
, useidentical()
, extract the first element, or do something else.We could continue to fix them as we bump into them but likely it's more efficient to fix them all in one go, allowing us to fix a bunch of bugs with relatively little code and time.