End-to-end-provenance / RDataTracker

An R library to collect provenance from R scripts.
http://end-to-end-provenance.github.io/
GNU General Public License v3.0
39 stars 6 forks source link

Data edges for variables created inside a for loop #680

Closed blernermhc closed 1 year ago

blernermhc commented 1 year ago
pet2.type <- c("cat", "cat", "dog", "cat", "dog")
pet2.name <- c("Sterling", "Smuggles", "Snickers", "Katama", "Bode")
df3 <- data.frame(pet2.name, pet2.type)
for (pet2.level in levels(df3$pet2.type)) {
  level <- pet2.level
  number <- nrow(df3[df3$pet2.type==level,])
  print(paste("There are ", number, level))
}

level and number are both set and used inside the for loop. They are not set before the loop starts. In rdtLite, where the for loop is treated as a single statement, these should appear as outputs from the for loop. Instead, they appear as inputs.

image
blernermhc commented 1 year ago

We should be looking at the statements inside control constructs to see if a variable is set before its first use. In that case, it should not appear as an input. But we could not know that for sure until runtime, and we are not collecting provenance inside control constructs so we can't do that.

Instead, when we create the node, we could see if it has a value before the statement is executed. If it doesn't, then either it will be set inside the construct, or there will be an error produced by R. It might have a value from a previous execution which will get captured in the provenance even though that value does not get used.

They are correctly showing up as outputs.