Change when precursor information gets returned?

bpbond commented 7 years ago

Currently the driver can query data system chunks for their inputs and outputs, and from the returned information, build pretty dependency graphs and figure out general dependency information--i.e., the chunk call sequence.

This dependency information is however chunk-level, i.e. coarse. To get object dependencies, we have to wait until chunks run and return this information (via the add_precursor mechanism). Then dstrace can then run cool object-level traces...but only after we run the entire data system and save all its output.

It seems more sensible to have the precursor information returned as part of the driver's DECLARE_OUTPUTS message to chunks. This would be instantaneous and obviate the need to run the data system to trace data objects. For example, instead of

  } else if(command == driver.DECLARE_OUTPUTS) {
    return(c("L110.For_ALL_bm3_R_Y"))
  } else if(command == driver.MAKE) {
 ...
      add_precursors("common/iso_GCAM_regID",
                     "L100.FAO_For_Prod_m3",
                     "L100.FAO_For_Imp_m3",
                     "L100.FAO_For_Exp_m3") %>%

the code would be

  } else if(command == driver.DECLARE_OUTPUTS) {
    return(list("L110.For_ALL_bm3_R_Y" = c("common/iso_GCAM_regID",
                     "L100.FAO_For_Prod_m3",
                     "L100.FAO_For_Imp_m3",
                     "L100.FAO_For_Exp_m3")))
  } else if(command == driver.MAKE) {

This would involve changing the driver and all the already-written chunks, obviously, but going forward wouldn't be any extra work for people. The same_precursors_as mechanism might be implemented by writing

  } else if(command == driver.DECLARE_OUTPUTS) {
    return(list("X" = c("1", "2", "3"),
                "Y" = "X")))
  } else if(command == driver.MAKE) {

Thoughts? @pralitp @rplzzz

pralitp commented 7 years ago

If only we thought of that before.. although it would be pretty easy to write a script to do it. Maybe we wait until version one and then just convert them all then?

Also I kind of like keeping the add_precursors and same_precursors_as as basically tags. It doesn't add anything except in my opinion more make more clear what is going on:

  } else if(command == driver.DECLARE_OUTPUTS) {
    return(list("X" %>% add_precursors("1", "2", "3"),
                "Y" %>% same_precursors_as("X"))))
  } else if(command == driver.MAKE) {

And then add_precursors could be as simple as the following (or if we wanted to return some "data structure" to accommodate more features we could do that too).

# actually I'm not sure if the is valid R but something like:
add_precursors <- function(key, ...) {
  return(key = ...)
}

One thing is right now we are tagging XML outputs with return (c( XML = "modeltime.xml")), this will need to change?

Last thought. Do have have a test to check the precursors are actually right? This would be a very long running test and so maybe we would only run it every once in a while but could be implemented by:

Running all chunks
For each chunk perturb each table in it's declare.INPUTS
Check to make sure that only the declare.OUPUTS tables changed which had declared the input as a precursor and the others did not.

rplzzz commented 7 years ago

If two objects share the same precursors, then surely the right way to do it is to say

} else if(command == driver.DECLARE_OUTPUTS) {
    pre <- c('1', '2', '3')
    return(list("X" = pre,
                "Y" = pre)))
}

That saves the rest of the system having to figure out how to parse the indirect reference.

It seems a little weird to me that the output from a declare outputs step will now be, not a list of the outputs, but a list where the outputs are the names attribute of the list, but maybe it's not a huge deal. I guess we ought to wait and see how much of a burden it eventually winds up being, having to run the data system to get this information.

Pralit's idea for validating the precursors is interesting. I would make it a utility rather than a test, and I'd make it run just on user-selected tables, rather than the whole system. The main use I see for it are that in code review for a chunk, a reviewer can check that the precursors have been set correctly. You don't need to check every table in the whole system for that.

bpbond commented 7 years ago

I absolutely agree that right now the declare outputs step is clean and simple, and am not crazy to muck it up.

@pralitp Wow, that's a cool idea about validating precursors. Wow. Agreed, a utility, not a test. Going to open a separate issue for this.

Waiting until after version 1 makes sense, agreed. OK, let's put this on the back burner for now, and return to it later. Thanks to you both.

bpbond commented 6 years ago

Closing, as this has largely been addressed using the internal-data approach of #751 .

JGCRI / gcamdata

Change when precursor information gets returned? #628