Closed margoseltzer closed 7 years ago
On a related tack, Thomas and I started an issue (https://github.com/End-to-end-provenance/RDataTracker/issues/168) for reviewing the CRAN policies (https://github.com/End-to-end-provenance/dev/wiki/Submitting-to-CRAN).
Here are a couple of other issues:
1) Packaging
It seems like good practice to tend toward modularity. I agree with the breakdown of packages, but I can't weigh in on Jena.
To start a discussion of the visualizers, I made a few pros and cons to start thinking about what we might to put our main efforts behind:
ddgexplorer
camflow
Is there a good model for a "bonafide test suite" that someone could direct me to?
With regard to Barb's suggestion of adding to the regression tests, I would be up for helping to create the regression test for the JSON-PROV, but I couldn't take the lead on this.
On the tack of removing files, Emery and I are working on pulling out the extraneous files from RDataTracker (https://github.com/End-to-end-provenance/RDataTracker/issues/170).
And, related to this: vignettes. These are out of date, and they create overhead in the package for content that is diplicated in the wikis. Any reason to keep these? We can always add them back later.
2) Output Format
Seems to me that, regardless of the trouble of removing things later, we're committed to using PROV-JSON because it's the standard provenance format. Based on that, it should be priority, and keeping any other format would add complexity and potentially take up development efforts in a way that is tangential to the main project goal of getting people to use the tools.
3) scalability and granularity
Retaining the full prov information in memory is likely to create issues. Several of the scripts that Annie worked on last summer hit the memory cap of R, which is already notorious for it's poor handling of memory.
I have less experience with strategies and theory for provenance content, but diffing seems the way to go.
Perhaps I don't understand the question of granularity fully, but if I'm thinking about data provenance, and wanting to know what happened to data, then any change at any scale should have a record. This is like viewing each process that "handles" the data as an "owner" or I suppose "parent" of the data. In R, I would think that this, at its smallest scale, would be a "scalar" value.
Transcribing this discussion to milestones and issues in RDataTracker.
Let's get serious about a CRAN-ready project release. I've put together a list of issues that I had notes on -- please feel free to comment, edit, add, etc.
Here are some topics we need to figure out:
Changing our primary output format after release will be a huge problem, so we need to resolve this now. Do we switch to JSON as our official output format or do we retain ddg.txt as the official output format.
Regardless, we have to make sure that our JSON output is complete and correct -- some things that we need to add:
We currently retain the entire prov graph in memory until execution completes. This inherently limits the size of analyses one can run; how can we get around this?
Should we enable recording of cell-level provenance rather than just dataframe/vector level granularity?