Project Release Planning

margoseltzer commented 8 years ago

Let's get serious about a CRAN-ready project release. I've put together a list of issues that I had notes on -- please feel free to comment, edit, add, etc.

Here are some topics we need to figure out:

Packaging: Tentatively, it appears that we need potentially separate packages: A) RDataTracker B) Visualizer (one or two of DDGExplorer/CamFlowR) C) CPL (but we probably can get by without this for the current release) D) Jena We need to convert the examples-no-instrumentation into a bonafide test suite. We need to move any other example code into a separate repository.
Output format:

Changing our primary output format after release will be a huge problem, so we need to resolve this now. Do we switch to JSON as our official output format or do we retain ddg.txt as the official output format.

Regardless, we have to make sure that our JSON output is complete and correct -- some things that we need to add:

Labels for nodes
Use Prov collections for collapsable data collections
Use Parent relationship for collapsable functions
The environment is currently an isolated node; it probably needs to somehow be part of the graph -- individual values could feed into analyses; the libraries might be a parent of the entire analysis; there are probably other possibilities.
1. Scalability:

We currently retain the entire prov graph in memory until execution completes. This inherently limits the size of analyses one can run; how can we get around this?

Granularity of recording:

Should we enable recording of cell-level provenance rather than just dataframe/vector level granularity?

We could save values rather than entire copies of the data
The size of the prov graph grows quite large

MKLau commented 8 years ago

On a related tack, Thomas and I started an issue (https://github.com/End-to-end-provenance/RDataTracker/issues/168) for reviewing the CRAN policies (https://github.com/End-to-end-provenance/dev/wiki/Submitting-to-CRAN).

blernermhc commented 8 years ago

On packaging: Jena is bundled in the ddg explorer jar file, so it would not need to be provided separately. Jena's license says: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution." The copyright notice is included in our Using DDGExplorer documentation.
Output format. It is not obvious to me why changing the primary output format after a release would be a big problem. Could you say a little more about that?

Here are a couple of other issues:

Testing the JSON format. The regression test suite that we use is very good but the tests are based on comparing an expected ddg.txt file with one generated by the test. I would recommend extending test.xml to do something similar with the JSON output. Of course, checking that the "expected" files are correct when they are first created requires manual inspection.
We need to go through the RDataTracker issues and decide which things need to be fixed before release.
I also would like us to try RDataTracker on scripts it has not been run on before. I remain concerned that new scripts often uncover new issues. It would be good to know that RDataTracker has a reasonable chance of working correctly when an external user tries to use it for the first time.

MKLau commented 8 years ago

1) Packaging

It seems like good practice to tend toward modularity. I agree with the breakdown of packages, but I can't weigh in on Jena.

To start a discussion of the visualizers, I made a few pros and cons to start thinking about what we might to put our main efforts behind:

ddgexplorer

pros: already developed, currently integrated, features include string searches, node collapsing, focused and large scale view, incremental drawing, currently a part of the RDataTracker package (maybe this is a con since it might have to be removed due to CRAN rules), used in vignettes and tutorial materials
cons: not integrated with RStudio, cumbersome for graphs of even moderately sized scripts and needs further development of algorithms that can help a viewer manage information in a useful way, dependent non-standard provenance format, some users at the DataVerse workshop, mentioned having difficulty visualizing their DDGs, having to open a visualizer in a separate window in a separate program increases the "activation energy" for users

camflow

pros: integrated with RStudio, low amount of coding to add features, integrated with other provenance (e.g. linux kernel), works with json, visualizer support partially provided by cytoscape's network algorithms, graphical interface is easy to change (CSS)
cons: many aspects of the visualizer and UI need development, very sensitive to bugs from JSON mis-formatting, output images are too low resolution

Is there a good model for a "bonafide test suite" that someone could direct me to?

With regard to Barb's suggestion of adding to the regression tests, I would be up for helping to create the regression test for the JSON-PROV, but I couldn't take the lead on this.

On the tack of removing files, Emery and I are working on pulling out the extraneous files from RDataTracker (https://github.com/End-to-end-provenance/RDataTracker/issues/170).

And, related to this: vignettes. These are out of date, and they create overhead in the package for content that is diplicated in the wikis. Any reason to keep these? We can always add them back later.

2) Output Format

Seems to me that, regardless of the trouble of removing things later, we're committed to using PROV-JSON because it's the standard provenance format. Based on that, it should be priority, and keeping any other format would add complexity and potentially take up development efforts in a way that is tangential to the main project goal of getting people to use the tools.

3) scalability and granularity

Retaining the full prov information in memory is likely to create issues. Several of the scripts that Annie worked on last summer hit the memory cap of R, which is already notorious for it's poor handling of memory.

I have less experience with strategies and theory for provenance content, but diffing seems the way to go.

Perhaps I don't understand the question of granularity fully, but if I'm thinking about data provenance, and wanting to know what happened to data, then any change at any scale should have a record. This is like viewing each process that "handles" the data as an "owner" or I suppose "parent" of the data. In R, I would think that this, at its smallest scale, would be a "scalar" value.

MKLau commented 7 years ago

Transcribing this discussion to milestones and issues in RDataTracker.

End-to-end-provenance / dev

Project Release Planning #42