CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

Compilation of analysis plots #265

Open tahorst opened 6 years ago

tahorst commented 6 years ago

I think a great new feature would be to allow the compilation of multiple analysis plots together. This will be something that the fathom tool helps with but I think it could be extended further. Seeing Eran's analysis plot and with my own experiences, I think this could improve development and reduce redundant code and the possibility that things get out of sync across different files.

Specifically for Eran's plot, 3 of the left hand side plots are created in other files as well. If we could drop them in or save a reference to the plot object (probably too memory intensive), we could reduce code duplication and runtime.

There are also some single analysis plots (centralCarbonMetabolismCorrelationTimeCourse.pdf) that are repeated for multigen analysis (centralCarbonMetabolismCorrelationTimeCourse.pdf). It would be convenient to get the single analysis plots from multigen or compile the single into a multigen but this might be slightly more difficult.

Another way this could be useful is for doing validation. Typically, there are a few analysis plots that are already made that I want to check as I'm trying to add a new feature. It would be great to have all of these compiled into a single file instead of needing to search for a few different files to ensure everything looks fine before and after changes.

I don't know the best way of doing this but I think something as simple creating a new firetask to run after all analysis scripts and specifying coordinates/sizes of already existing plots would be a great start. Wondering if other people think this is a good idea or not.

eagmon commented 6 years ago

This would be a useful feature, to just plug existing analyses together for easy comparison. I would hold off on it for now though, until our collaboration with fathom comes to a close. At that point we will want to re-evaluate what analysis needs the fathom tool is not addressing, and focus the analysis scripts on that.

jmason42 commented 6 years ago

I agree with the spirit of this. Duplicated code and duplicated output does us no good. I don't see an easy way to compile plots, unfortunately. The obvious/easy way to save plots in a format that we can glue together is PDF or PNG, but that would require some complicated code and the formatting would look awful. (I'll admit to being curious whether you can pickle a matplotlib figure, but in practice I'm wholly opposed.) That said, if two analysis plots share the same code (or nearly enough), we would benefit from factoring that out. Like @eagmon, the Fathom tool (does it have a name?) would probably be the best way to do this.

This has got me thinking, though - we should do an inventory of analysis plots. I'll open up an issue.

prismofeverything commented 6 years ago

I think this is a great consideration, and I would go a step further and say we should decouple the analyses from the plots entirely. Right now, because they are coupled, we end up with repetition of plots across analyses like you observe. If the result of each analyses output only data, and each plot reads some subset of the data output by the analyses, then they could vary independently.

In a way this is a general pattern of functional decomposition. If we consider each analysis as a function that accepts inputs and produces outputs, then we benefit from reducing the scope of each individual function in that the outputs can be applied and analyzed further in ways that we can't necessarily predict beforehand. It allows the output of one analysis to be an input to another later, as well as being the input to a plot (the leaves of the tree so to speak). We have the flexibility to do additional analysis or plotting later, as well as enabling a particular functionality (as you observe here, specific plotting functions) to exist in one place while potentially applying to any inputs of the right form. If it all happens at one time we are on rails, so to speak, and are forced to reproduce work and functionality.

There is an art to where you draw the lines between these functions, but one good starting point is that once the result of each analysis is further data, if there is a small defined set of shapes/forms/formats the data is encoded in then you can define further analysis/plotting in terms of these formats, which makes more plots applicable to more analysis output. These common formats would provide guidance in how to design both analyses and plots. In a way you get a multiplicative effect of mutual applicability, as opposed to our more linear system now.

Granted the above discussion would take quite a bit of work to apply to our current situation, but it is something to keep in mind when creating new analyses and also when refactoring our current analyses and plotting. I would like to take a survey of our current analyses and see if there is a handful of common data formats their output can be rendered to, and then a similar survey of the plotting functions and see if these can be generalized as well. The outcome of all this would be that instead of writing plotting for each new analysis we create, we would likely be able to apply an already existing plotting function to it, thereby saving everyone a lot of work, which I think would be a pretty big win.

jmason42 commented 6 years ago

I'm personally opposed to a dependency tree of analysis plots. I'd like to see a convincing use case; e.g. a situation where factoring out a shared function or saving data to a new Listener isn't the right approach. I could maybe bend on a high-level dependency chain e.g. single -> multigen -> cohort -> variant (I think that's the right order). However, at that point, we should probably separate out data aggregation from plot rendering.

prismofeverything commented 6 years ago

I'm personally opposed to a dependency tree of analysis plots.

You already have one, it is just not very well decoupled ; )

I'd like to see a convincing use case; e.g. a situation where factoring out a shared function or saving data to a new Listener isn't the right approach.

The situation Travis describes is already a convincing use case to me. The approach you describe, refactoring out shared functions, is performing the same work, just in a different place. Either way, it sounds like we agree on what needs to happen structurally. I prefer decoupling functions into separate processes to promote parallelization and distribution of work, but however it happens abstracting duplicated work into a single place is the solution to the problem.

jmason42 commented 6 years ago

Show me a convincing use case.

1fish2 commented 6 years ago

While the Fathom project redesigns the model's analysis and visualization, a little shell script could merge several PDFs into one using ghostscript or using Python pdftools.

@tahorst could you teach us what to look for in some plots, esp. the ones that are soft integration tests?

prismofeverything commented 6 years ago

Show me a convincing use case.

It is not about a particular use case, it is a general organizational principle of computational systems. The advantages are increased flexibility, increased parallelization, reduced duplication of work and multiplicative applicability of the results of one process being used as the inputs of another.

You are already using a workflow engine so to some degree you must already have some belief in this principle, even if you don't recognize it as such. The question in my mind is not whether this principle is valuable (which is obvious to me) but to what level we decompose the work we have to do into separate processes to maximize the benefit of the computational resources we have at our disposal. Unless you think we should go back to doing everything (fitter, simulation, analysis/plotting) in a single process in which case I'm not sure how to save you.

If your argument is that all analyses and plotting should be done in the same process for some reason that is one thing, but based on our discussion yesterday about how we can't even finish running all of the analyses we have right now because we run out of resources part way through, I'm not sure how you justify that position.