End-to-end-provenance / dev

Repository for developer tools.
0 stars 0 forks source link

Project Release Planning #42

Closed margoseltzer closed 7 years ago

margoseltzer commented 8 years ago

Let's get serious about a CRAN-ready project release. I've put together a list of issues that I had notes on -- please feel free to comment, edit, add, etc.

Here are some topics we need to figure out:

  1. Packaging: Tentatively, it appears that we need potentially separate packages: A) RDataTracker B) Visualizer (one or two of DDGExplorer/CamFlowR) C) CPL (but we probably can get by without this for the current release) D) Jena We need to convert the examples-no-instrumentation into a bonafide test suite. We need to move any other example code into a separate repository.
  2. Output format:

Changing our primary output format after release will be a huge problem, so we need to resolve this now. Do we switch to JSON as our official output format or do we retain ddg.txt as the official output format.

Regardless, we have to make sure that our JSON output is complete and correct -- some things that we need to add:

We currently retain the entire prov graph in memory until execution completes. This inherently limits the size of analyses one can run; how can we get around this?

  1. Granularity of recording:

Should we enable recording of cell-level provenance rather than just dataframe/vector level granularity?

MKLau commented 8 years ago

On a related tack, Thomas and I started an issue (https://github.com/End-to-end-provenance/RDataTracker/issues/168) for reviewing the CRAN policies (https://github.com/End-to-end-provenance/dev/wiki/Submitting-to-CRAN).

blernermhc commented 8 years ago
  1. On packaging: Jena is bundled in the ddg explorer jar file, so it would not need to be provided separately. Jena's license says: "Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution." The copyright notice is included in our Using DDGExplorer documentation.
  2. Output format. It is not obvious to me why changing the primary output format after a release would be a big problem. Could you say a little more about that?

Here are a couple of other issues:

MKLau commented 8 years ago

1) Packaging

It seems like good practice to tend toward modularity. I agree with the breakdown of packages, but I can't weigh in on Jena.

To start a discussion of the visualizers, I made a few pros and cons to start thinking about what we might to put our main efforts behind:

ddgexplorer

camflow

Is there a good model for a "bonafide test suite" that someone could direct me to?

With regard to Barb's suggestion of adding to the regression tests, I would be up for helping to create the regression test for the JSON-PROV, but I couldn't take the lead on this.

On the tack of removing files, Emery and I are working on pulling out the extraneous files from RDataTracker (https://github.com/End-to-end-provenance/RDataTracker/issues/170).

And, related to this: vignettes. These are out of date, and they create overhead in the package for content that is diplicated in the wikis. Any reason to keep these? We can always add them back later.

2) Output Format

Seems to me that, regardless of the trouble of removing things later, we're committed to using PROV-JSON because it's the standard provenance format. Based on that, it should be priority, and keeping any other format would add complexity and potentially take up development efforts in a way that is tangential to the main project goal of getting people to use the tools.

3) scalability and granularity

Retaining the full prov information in memory is likely to create issues. Several of the scripts that Annie worked on last summer hit the memory cap of R, which is already notorious for it's poor handling of memory.

I have less experience with strategies and theory for provenance content, but diffing seems the way to go.

Perhaps I don't understand the question of granularity fully, but if I'm thinking about data provenance, and wanting to know what happened to data, then any change at any scale should have a record. This is like viewing each process that "handles" the data as an "owner" or I suppose "parent" of the data. In R, I would think that this, at its smallest scale, would be a "scalar" value.

MKLau commented 7 years ago

Transcribing this discussion to milestones and issues in RDataTracker.