Closed MKLau closed 6 years ago
After giving this some though, my preference is to start from the current version of provR, as there has already been been feature updates done here.
Thoughts on any/all of the above?
I also wondering if perhaps we should also think about re-designing how provR is structured? As in things like how data is represented, which functions live in which files, whether we should have more objects?
Good suggestions, I’m thinking this could happen after simplifying. What do you think?
Hmm...well, what I am thinking is that rather than directly working on provR, maybe it is better (and easier) if we say, work on a separate set of files and copy from provR and RDataTracker as needed. I am not sure if I am getting my idea across properly but I tend to find it easier to add rather than remove. In that way, we can restructure at the same time.
Ok, to clarify a bit, are you suggesting pulling features from rdatatracker and adding them into provR?
Oh not at all! I'm saying that perhaps we should start from a blank provR and pick and choose what to add into it from the existing provR or RDataTracker. In that way, we can have a clearer picture on how everything is structured and how it all fits together. Does that kind of make sense?
Oh yeah, that's what I was thinking you were saying.
Thanks for the helpful discussion on this. Based on Elizabeth's suggestion, I spent some time looking at the code for RDT and provR again, and I also took a look at the work/experimentation that Emery's doing on the RDT@lite branch he just opened up yesterday.
Something that I realized in this is that I neglected to include in the plan that changes to the code should be happening as pull requests, which would allow us to control the feature changes (whether adds or removals). Thus, modularization would happen in the code history via the pull request system, with smaller changes occurring in the commits.
From my vantage, I'm seeing that provR is providing insights that will feed back to help with the refactoring that should happen with RDT by reducing the code base complexity and put forward a rock solid capture tool for R. All features that are getting pruned back will be preserved as they are functioning in RDT and in the commit history.
At this point, my sense is that it's really complicated to dig into the code of even provR let alone RDT given the tangle. Seems like it would prevent a significant loss of time to work from the current provR and continue to prune. Given that the feature changes are not lost code, but preserved in two code bases, I want to move forward with the continued evolution of provR.
However, I want there to be clear proposals of how features are being removed. This way, we can all learn from the process.
Since Emery (@erboose) is doing experimentation in RDT still and Elizabeth (@efong22e) is working on JSON issues (thanks to both of you!), keep running with those things.
Thomas (@tfjmp), go ahead and start with some testing and mapping of the first round of changes you want to make. I'll create a pull request template this afternoon/evening.
Thanks!
@erboose @fong22e @blernermhc @tfjmp
Here are some features currently supported in RDataTracker. Which (if any) of these features should be preserved in provR?
Capture information on computing environment Capture provenance for console sessions Capture provenance for sourced scripts Capture provenance for RMarkdown scripts Capture provenance inside functions Capture provenance inside control constructs
Save simple data values to JSON string Save R error & warning messages to JSON string Save JSON string to file system
Save snapshots of complex data values to file system Save copies of input & output files to file system Save copies of main & sourced scripts to file system Save copies of screen graphics to file system Save hash table of input & output files to file system
Create start & finish nodes for logical sections of code Set & clear breakpoints for script execution Display current provenance graph in DDG Explorer Display RDT debugging information in R console
Thanks Emery for posting these.
Here's what I think should be done:
KEEP
??? Capture provenance inside control constructs
??? Capture provenance for console sessions
REMOVE:
So, the following should be removed:
R Capture provenance for RMarkdown scripts R Capture provenance inside functions
R Save simple data values to JSON string R Save JSON string to file system
R Save snapshots of complex data values to file system R Save copies of input & output files to file system R Save copies of main & sourced scripts to file system R Save copies of screen graphics to file system R Save hash table of input & output files to file system
R Set & clear breakpoints for script execution R Create start & finish nodes for logical sections of code
R Display current provenance graph in DDG Explorer R Display RDT debugging information in R console
@tfjmp Hey Thomas, please move ahead when you have time with removing the above marked with R for REMOVE. I've already just merged pull request #66 to remove the save API.
Note that the current strategy for collecting provenance for sourced scripts and inside control constructs requires modifying the user's script.
Sent from my iPhone
On Nov 6, 2017, at 4:34 PM, MKLau notifications@github.com<mailto:notifications@github.com> wrote:
KEEP
??? Capture provenance inside control constructs
??? Capture provenance for console sessions
REMOVE:
So, the following should be removed:
R Capture provenance for RMarkdown scripts R Capture provenance inside functions
R Save simple data values to JSON string R Save JSON string to file system
R Save snapshots of complex data values to file system R Save copies of input & output files to file system R Save copies of main & sourced scripts to file system R Save copies of screen graphics to file system R Save hash table of input & output files to file system
R Set & clear breakpoints for script execution R Create start & finish nodes for logical sections of code
R Display current provenance graph in DDG Explorer R Display RDT debugging information in R console
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ProvTools_provR_issues_59-23issuecomment-2D342294759&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=ag7jIu6cXZRMQOvOev7l3B6_QsTS8XUZS_fW0in0S58&s=qAKHTJ3nCugtSWh1g_0xV9lmEKDGn0V0R-VfgzcSBAg&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHHAKzXR9HnDzlz1i2sDRi4vaDqLPKH8ks5sz3tcgaJpZM4QLclG&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=ag7jIu6cXZRMQOvOev7l3B6_QsTS8XUZS_fW0in0S58&s=EqO7vavJzm1MnwAtcLcyTUz4aPQ_at1Vxc5CvFAl_88&e=.
Does this mean that both of these features require writing to disk as they're currently implemented?
No. But they do require executing a modified version of the original script.
Sent from my iPhone
On Nov 6, 2017, at 6:32 PM, MKLau notifications@github.com<mailto:notifications@github.com> wrote:
Does this mean that both of these features require writing to disk as they're currently implemented?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ProvTools_provR_issues_59-23issuecomment-2D342323928&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=6CXCwjJEOodFC3J3AGKYHFRLoASxQARazmxCAxVW3TI&s=60-uNyMQqQRz9XdO316y5itotyr-Ww9Ej8RnNJ78Sfs&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHHAK9yOt8Sq01nkPfqKULIoa4BGb4Njks5sz5cSgaJpZM4QLclG&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=kbSzXk5uuINmNoo4MekzCE-8oe-IGwHYbug8SQuMN3U&m=6CXCwjJEOodFC3J3AGKYHFRLoASxQARazmxCAxVW3TI&s=zooUGPESZDZ9x-FtVvRw2rHEONsTlSidYijbFNzGHpg&e=.
Sourced scripts are currently handled by replacing "source" with "ddg.source". But there may be other solutions that would avoid the script modification problem.
Is the modified script passed to source in memory?
The original script or sourced script is read into memory and parsed. For each parsed statement a DDG statement object is created that contains both the original parsed statement and a modified (annotated) version of the parsed statement. It is the annotated version that is actually executed. All of this happens in memory.
Ok great, then that feature should be fine to keep. Thanks for clarifying.
The best way to remove functions is to remove functionality one piece at a time. If this is done with correct commits and branching, these features should be preserved and allow for further development and inclusion at a future date.