arendsee / rmonad

Pipelines you can compute on
70 stars 0 forks source link

Tags not working correctly with branches #21

Open swhalemwo opened 1 year ago

swhalemwo commented 1 year ago

just found this package, really fascinating stuff! I played with it a bit to see if I can integrate it in my scripts (I often have many steps, and having access to previous steps would be great), and it seems there is some bug when using tags together branches. In the MWE I perform some operation, save some number on this operation in a branch, and then continue the pipeline.

library(dplyr)
library(rmonad)

m <- as_monad(as_tibble(iris)) %>% tag('input')%>>%
        dplyr::filter(Species == "setosa") %>% tag('filter1') %>^%
        nrow %>% tag("nrow") %>>%
        dplyr::filter(Sepal.Width > 3) %>% tag("filter2")

However, I haven't found a way to assess the information I saved in the nrow tag: get_value(m, tag = "nrow") and view(m, "nrow") both return the dataset at the filter1 tag. My hunch is that it is related to get_tag(m), since the third node is an empty list:

[[1]]
[[1]][[1]]
[1] "input"

[[2]]
[[2]][[1]]
[1] "filter1"

[[2]][[2]]
[1] "nrow"

[[3]]
list()

[[4]]
[[4]][[1]]
[1] "filter2"

If I add another branch, I get even more empty lists:

m2 <- as_monad(as_tibble(iris)) %>% tag('input')%>>%
        dplyr::filter(Species == "setosa") %>% tag('filter1') %>^%
        nrow %>% tag("nrow") %>^%
        ncol %>% tag("ncol") %>>%
        dplyr::filter(Sepal.Width > 3) %>% tag("filter2")

get_tag(m2)

[[1]]
[[1]][[1]]
[1] "input"

[[2]]
[[2]][[1]]
[1] "filter1"

[[2]][[2]]
[1] "nrow"

[[2]][[3]]
[1] "ncol"

[[3]]
list()

[[4]]
list()

[[5]]
[[5]][[1]]
[1] "filter2"

Is it possible to fix this?

arendsee commented 1 year ago

@swhalemwo Thanks for report! I haven't touched this package since grad school. I'm not sure I can fix it.

swhalemwo commented 1 year ago

Thanks for the info! after some digging I found tagging branches actually works when tagging is in the branch as well (in the previous example, the tag("nrow") is actually performed correctly to the "filter1" level because after finishing the branch opened by %>^% the pipeline resets to the stage before the branch off point.

Having figured that out I managed to write a custom branch-of-and-tag function that applies a function to the head, tags it with the provided tag and then resets to the previous head:

tag2 <- function(m, expr, tag) {

    ## get id of head 
    head_id <- which(igraph::vertex_attr(m@graph)$name == m@head)
    head_hash <- m@head

    ## check if head has tag; if not set, use hash
    head_tag <- get_tag(m, index = head_id) %>% unlist
    if (is.null(head_tag)) {
        m <- tag(m, head_hash)
        head_tag <- head_hash
    }

    ## evaluate expression
    cmd <- list(bind, substitute(m), substitute(expr))
    envir <- parent.frame()
    m <- eval(as.call(cmd), envir = envir)

    ## assign tag and revert to old head
    m <- m %>% tag(tag)
    m <- m %>% view(head_tag)
    m

}

Now I can quickly branch of one-off calculations.

m <- as_monad(as_tibble(iris)) %>% tag('input') %>>%
    dplyr::filter(Species == "setosa") %>% tag('filter1') %>%
    tag2(nrow, "nrow") %>>%
    dplyr::filter(Sepal.Width > 3) %>% tag("filter2")

Also if I may ask, is there another framework that you're now using for managing data pipelines? I'm also in grad school now, and my previous project turned into tens of thousands LoC of data processing steps which were very hard to keep track off, so before I start the next one I'm looking into different ways of managing the complexity. I also had a look at targets, chronicler and testthat, and like this the most so far.. Any recommendations?

arendsee commented 1 year ago

There are pipeline programs for handling high-performance computing that focus on caching, cloud computing, provenance, and all that. Things like Nextflow. None of them are very pleasant creatures, though. I guess this is why everyone keeps spawning new ones. But these frameworks don't address the complexity problem. They often make the problem worse by adding wrappers around everything and an idiosyncratic control layer on top. It is hard to solve the complexity problem with a framework.

I think you are asking the right questions, though. It is nice to see someone else who isn't happy with normal approaches to dealing with complexity. Most in my field assume this mess is a fact of nature. I disagree though and believe we can write elegant and bug-free code. The trick is to build small functions that we know are correct and compose them into larger programs using operations we know are correct. This view is a bit heretical, though, in most circles.