Removal strategy - Githubissues

MKLau commented 7 years ago

The best way to remove functions is to remove functionality one piece at a time. If this is done with correct commits and branching, these features should be preserved and allow for further development and inclusion at a future date.

MKLau commented 7 years ago

Identify functions associated with particular features (core, internals, snapshotting, writing)
For each feature not in the core (i.e. dependencies of prov.capture and prov.json), remove the functions in a series of commits with thorough commenting
Remove the functions from previously dependent functions that are in the core
Examine code structure and identify modules and places to re-organize and refactor
Update function input arguments and output
Update documentation

After giving this some though, my preference is to start from the current version of provR, as there has already been been feature updates done here.

Thoughts on any/all of the above?

fong22e commented 7 years ago

I also wondering if perhaps we should also think about re-designing how provR is structured? As in things like how data is represented, which functions live in which files, whether we should have more objects?

MKLau commented 7 years ago

Good suggestions, I’m thinking this could happen after simplifying. What do you think?

fong22e commented 7 years ago

Hmm...well, what I am thinking is that rather than directly working on provR, maybe it is better (and easier) if we say, work on a separate set of files and copy from provR and RDataTracker as needed. I am not sure if I am getting my idea across properly but I tend to find it easier to add rather than remove. In that way, we can restructure at the same time.

MKLau commented 7 years ago

Ok, to clarify a bit, are you suggesting pulling features from rdatatracker and adding them into provR?

fong22e commented 7 years ago

Oh not at all! I'm saying that perhaps we should start from a blank provR and pick and choose what to add into it from the existing provR or RDataTracker. In that way, we can have a clearer picture on how everything is structured and how it all fits together. Does that kind of make sense?

MKLau commented 7 years ago

Oh yeah, that's what I was thinking you were saying.

MKLau commented 7 years ago

Thanks for the helpful discussion on this. Based on Elizabeth's suggestion, I spent some time looking at the code for RDT and provR again, and I also took a look at the work/experimentation that Emery's doing on the RDT@lite branch he just opened up yesterday.

Something that I realized in this is that I neglected to include in the plan that changes to the code should be happening as pull requests, which would allow us to control the feature changes (whether adds or removals). Thus, modularization would happen in the code history via the pull request system, with smaller changes occurring in the commits.

From my vantage, I'm seeing that provR is providing insights that will feed back to help with the refactoring that should happen with RDT by reducing the code base complexity and put forward a rock solid capture tool for R. All features that are getting pruned back will be preserved as they are functioning in RDT and in the commit history.

At this point, my sense is that it's really complicated to dig into the code of even provR let alone RDT given the tangle. Seems like it would prevent a significant loss of time to work from the current provR and continue to prune. Given that the feature changes are not lost code, but preserved in two code bases, I want to move forward with the continued evolution of provR.

However, I want there to be clear proposals of how features are being removed. This way, we can all learn from the process.

Since Emery (@erboose) is doing experimentation in RDT still and Elizabeth (@efong22e) is working on JSON issues (thanks to both of you!), keep running with those things.

Thomas (@tfjmp), go ahead and start with some testing and mapping of the first round of changes you want to make. I'll create a pull request template this afternoon/evening.

Thanks!

@erboose @fong22e @blernermhc @tfjmp

erboose commented 7 years ago

Here are some features currently supported in RDataTracker. Which (if any) of these features should be preserved in provR?

Capture information on computing environment Capture provenance for console sessions Capture provenance for sourced scripts Capture provenance for RMarkdown scripts Capture provenance inside functions Capture provenance inside control constructs

Save simple data values to JSON string Save R error & warning messages to JSON string Save JSON string to file system

Save snapshots of complex data values to file system Save copies of input & output files to file system Save copies of main & sourced scripts to file system Save copies of screen graphics to file system Save hash table of input & output files to file system

Create start & finish nodes for logical sections of code Set & clear breakpoints for script execution Display current provenance graph in DDG Explorer Display RDT debugging information in R console

MKLau commented 7 years ago

Thanks Emery for posting these.

Here's what I think should be done:

KEEP

Capture information on computing environment
Capture provenance for sourced scripts
Capture = "Save" R error & warning messages as JSON
Last, I was not award until recently that there are functions inside of RDT that are used to debug the package, if they cannot be implemented outside of the package, then they should be kept

??? Capture provenance inside control constructs

I have been advocating to remove this, but after talking with Thomas, it sounds as though this is necessary for completely representing a graph

??? Capture provenance for console sessions

If this works it would be great to have, if it currently is not working it should be removed and worked on a fork

REMOVE:

Anything does not relate to a supported analytical application
Since provR is intended only for caputuring provenance, nothing should be written to disk
Again, since provR is only for capture, any feature that does not relate to capture (e.g., calling a the visualizer) should be moved to a new package

So, the following should be removed:

R Capture provenance for RMarkdown scripts R Capture provenance inside functions

R Save simple data values to JSON string R Save JSON string to file system

R Save snapshots of complex data values to file system R Save copies of input & output files to file system R Save copies of main & sourced scripts to file system R Save copies of screen graphics to file system R Save hash table of input & output files to file system

R Set & clear breakpoints for script execution R Create start & finish nodes for logical sections of code

R Display current provenance graph in DDG Explorer R Display RDT debugging information in R console

MKLau commented 7 years ago

@tfjmp Hey Thomas, please move ahead when you have time with removing the above marked with R for REMOVE. I've already just merged pull request #66 to remove the save API.

erboose commented 7 years ago

Note that the current strategy for collecting provenance for sourced scripts and inside control constructs requires modifying the user's script.

Sent from my iPhone

On Nov 6, 2017, at 4:34 PM, MKLau notifications@github.com<mailto:notifications@github.com> wrote:

KEEP

Capture information on computing environment
Capture provenance for sourced scripts
Capture = "Save" R error & warning messages as JSON
Last, I was not award until recently that there are functions inside of RDT that are used to debug the package, if they cannot be implemented outside of the package, then they should be kept

??? Capture provenance inside control constructs

I have been advocating to remove this, but after talking with Thomas, it sounds as though this is necessary for completely representing a graph

??? Capture provenance for console sessions

If this works it would be great to have, if it currently is not working it should be removed and worked on a fork

REMOVE:

Anything does not relate to a supported analytical application
Since provR is intended only for caputuring provenance, nothing should be written to disk
Again, since provR is only for capture, any feature that does not relate to capture (e.g., calling a the visualizer) should be moved to a new package

So, the following should be removed:

R Capture provenance for RMarkdown scripts R Capture provenance inside functions

R Save simple data values to JSON string R Save JSON string to file system

R Save snapshots of complex data values to file system R Save copies of input & output files to file system R Save copies of main & sourced scripts to file system R Save copies of screen graphics to file system R Save hash table of input & output files to file system

R Set & clear breakpoints for script execution R Create start & finish nodes for logical sections of code

R Display current provenance graph in DDG Explorer R Display RDT debugging information in R console

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ProvTools_provR_issues_59-23issuecomment-2D342294759&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=ag7jIu6cXZRMQOvOev7l3B6_QsTS8XUZS_fW0in0S58&s=qAKHTJ3nCugtSWh1g_0xV9lmEKDGn0V0R-VfgzcSBAg&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHHAKzXR9HnDzlz1i2sDRi4vaDqLPKH8ks5sz3tcgaJpZM4QLclG&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=ag7jIu6cXZRMQOvOev7l3B6_QsTS8XUZS_fW0in0S58&s=EqO7vavJzm1MnwAtcLcyTUz4aPQ_at1Vxc5CvFAl_88&e=.

MKLau commented 7 years ago

Does this mean that both of these features require writing to disk as they're currently implemented?

erboose commented 7 years ago

No. But they do require executing a modified version of the original script.

Sent from my iPhone

On Nov 6, 2017, at 6:32 PM, MKLau notifications@github.com<mailto:notifications@github.com> wrote:

Does this mean that both of these features require writing to disk as they're currently implemented?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ProvTools_provR_issues_59-23issuecomment-2D342323928&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=6CXCwjJEOodFC3J3AGKYHFRLoASxQARazmxCAxVW3TI&s=60-uNyMQqQRz9XdO316y5itotyr-Ww9Ej8RnNJ78Sfs&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHHAK9yOt8Sq01nkPfqKULIoa4BGb4Njks5sz5cSgaJpZM4QLclG&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=6CXCwjJEOodFC3J3AGKYHFRLoASxQARazmxCAxVW3TI&s=zooUGPESZDZ9x-FtVvRw2rHEONsTlSidYijbFNzGHpg&e=.

erboose commented 7 years ago

Sourced scripts are currently handled by replacing "source" with "ddg.source". But there may be other solutions that would avoid the script modification problem.

MKLau commented 7 years ago

Is the modified script passed to source in memory?

erboose commented 7 years ago

The original script or sourced script is read into memory and parsed. For each parsed statement a DDG statement object is created that contains both the original parsed statement and a modified (annotated) version of the parsed statement. It is the annotated version that is actually executed. All of this happens in memory.

MKLau commented 7 years ago

Ok great, then that feature should be fine to keep. Thanks for clarifying.

ProvTools / provR

Removal strategy #59