DARPA-ASKE / info-and-links

8 stars 3 forks source link

Another question on the Revised Sketch: extraction from executable code #2

Open crapo opened 5 years ago

crapo commented 5 years ago

The revised sketch (pg 4) apparently makes no distinction between executable code in the center column, which I'm kind of seeing as the ASKE knowledge base, and other code from which one might do extraction. I'm assuming that we don't see code as a "semi-structured source". I see two significant differences between these two categories of code.

  1. The code in the center column has been curated whereas other code from which we wish to do extraction is a potential but as yet uncurated source of scientific knowledge.
  2. The code in the center column is known to be executable whereas arbitrary source code from which we may do extraction may not be because, for example, we do not have access to all of the imported libraries.

How am I seeing the sketch incorrectly?

pmjoshua commented 5 years ago

For context, i'm attempting to post the sketch here. let's see if its works.

It is our intention that an application of the ASKE tools could start from scratch, formulating a model, choosing frameworks, etc. or could start from an existing model already expressed as executable source code. A fundamental premise of the program is that there is a vast amount of useful scientific knowledge that is "locked away" in existing code. This includes codes that are poorly designed and may not be easily parsed, but we are hoping we can develop sufficiently robust tools for many or most cases. The Arizona group is working on this explicitly and the Siemens group is doing code extraction for the ML domain.

Currently we are assuming that the code is executable. that is we have access to all the libraries and etc. This is potentially important because the libraries tell us additional information about the modeling framework and implementation choices and whatnot.

For a project that starts with an existing model code (or even a family of models that purport to study the same system perhaps), we want to be able to move "up the stack", extracting the knowledge to a structured model format and then abstracting from there to the domain formulations. Hopefully a domain expert will be able to work with ASKE tools to accomplish this and then once completed will be able to take advantage of all the awesomeness to build better stronger faster models and use them in better stronger faster ways.

Does this help?

sketch

crapo commented 5 years ago

That certainly clarifies what the sketch represents. I have been thinking that some code will be easily enough executed to support this view, and can contribute knowledge through extraction and/or through execution in its original form. However, I have thought that when it comes to mining all available sources there will be a lot of legacy code that may be quite difficult to execute but is still potentially useful from an extraction point of view. In other words, I thought when it came to arbitrary legacy source code, the emphasis was on extraction of scientific knowledge from, but not necessarily execution of, the code.

pmjoshua commented 5 years ago

Don't get me wrong, i very much hope that we can harvest the knowledge from legacy code independent of whether it can be executed in its given form. In fact, one of my fantasies is that we would eventually have systems robust enough to extract the knowledge from what is probably the worst written behemoth of a model that has ever been conceived: the DOE EIA National Energy Modeling System (NEMS; https://www.eia.gov/outlooks/aeo/info_nems_archive.php). This model has a really interesting history and is mandated by law to be run every year or so. This, combined with some very bad design decisions in the early 90s has led to a behemoth Frankenstein code that is a million or more lines of fortran and only runs on one specific machine running windows 95 server or some crap (the site actually claims its now compatible with windows 2012 server). To my knowledge, no one outside this one office in EIA has ever actually gotten it to run, even though all of it (except for some lame CGE economic model component created by a consulting firm) is open source. Over the years members of congress have forced them to add representations of whatever pet energy project they have in their district and the result has been an onion with so many layers that no one dares to try and peel it.

Its my white whale...

jpfairbanks commented 5 years ago

In terms of using code that you can execute, there are definitely pieces of information that are really hard to get out of code with static analysis, but easy to get at with dynamic analysis. So I think that having the code in a form you can execute is important. This is probably more important for fully dynamic languages like python and R. For fully static languages like C and Fortran should be mostly analyzable with static techniques, but there are a lot of python libraries that do some variation of run time metaprogramming using eval or decorators.

crapo commented 5 years ago

Given that both are relevant in particular contexts, we will frame the problem with to include: 1) code which is executable within the framework as both an extraction source and an output, and 2) non-executable code as a structured extraction source existing outside the framework.