NCEAS / metadig-engine

MetaDig Engine: multi-dialect metadata assessment engine
7 stars 5 forks source link

ensure check engine can use full suite of libraries #12

Closed mbjones closed 8 years ago

mbjones commented 8 years ago

In the check engine implementation, we need to ensure that all commonly used libraries and packages can be used by the check author. For example, within R, all of the common data manipulation packages must be accessible to check authors, as well as common data manipulation and visualization packages.

Some examples in R that need to be accessible:

Some example python libraryes that need to be accessible:

Let's discuss a complete list of requirements so that we can ensure the execution engine will be able to meet our check-writing needs.

gothub commented 8 years ago

This can be added as a junit test.

leinfelder commented 8 years ago

Well, Renjin does not do any graphics, so that's out if this is a hard requirement. II don't know if Jython allows graphical output using the ScriptEngine interface. If it did, perhaps we could compromise and have Python do the graphical stuff, and R could do other computations.

I'm curious...if a check produced graphical output, how do you envision it being used? To me, that's a very "INFO" level check that just summarizes some aspect of the package for human viewing. And perhaps we could defer this to a 2.0 release of the engine?

amoeba commented 8 years ago

As we discussed a while ago, Renjin only packages a minimal subset of the packages on CRAN. Unavailable packages struck through:

At the time of discussing I had felt agreement that this was okay and that we would instruct check writers to use base R facilities wherever possible. If this isn't sufficient, we need to either expose a real copy of R on the same host as the MDQ is installed or do it in a containerized fashion. There are probably other options.

mbjones commented 8 years ago

@mecum, I don't think I was in on that discussion about what packages were needed. Realistically, I don't think base R is sufficient for the range of checks I envision, or at least it would make it a lot harder for check writers. Especially for the INFO checks that will be critical for data checks.

I guess I don't see why we even need an engine like Renjin, when we could be starting up an R environment in a separate process and loading it with the needed data and metadata. Based on our prior experience with trying to execute R code from within Kepler, I think its a mistake to embed that execution inside of Java if it means we don't have access to the full R environment.

leinfelder commented 8 years ago

I'm still waiting to see what these "range of checks" are that are so darn complex.

Kepler's initial RExpression actor (which is still the default in use today) is pretty crazy. it uses stdin, stdout, and stderr to pass data to and from the R process and does a ton of string parsing to translate between data serializations. It's almost as bad as screen-scraping. On top of that, it requires that R be installed separately from the engine. That's probably not a huge hurdle for us, but it was for Kepler's users.

If you want to go this direction it's very different than the (quite neat and useful) ScriptEngine mechanism we are taking advantage of currently.

Long story short: I would like to see an example of a check that needs a library that is not supported by Renjin, Jython, JavaScript, or Java.

amoeba commented 8 years ago

On one hand, Renjin makes it considerably easier for us to communicate between the Engine and the check's language-specific execution environment. There's no IPC here, it just works. On the other hand, Renjin is limited and these limitations may lead to frustration and confusion for check writers. Does the ease and power of using Renjin give us enough benefit to be okay with this frustration?

@leinfelder: I think where Matt's going is that R users are all about the extra packages. Sure, they can do their work with base R but they may not even know how. For even a very simple analysis, scientists are loading tens of packages (explicit+implicit) above the base R packages just to view some data and calculate a statistic. If a new check writer sits down to write a check and the first thing they experience is not being able to write R the way they like to write R, that could be a negative enough of an experience to make them not want to use our software.

Not having access to dplyr and tidyr, two of the most common packages in basic data analysis, might be a deal-breaker for some users.

I see this thread bringing us toward a decision point. Stick with Renjin or no? If we move away from Renjin and to using a full-blood R process, we need to define a mechanism and format for IPC, yeah? Maybe something less hokey than what Kepler does?

gothub commented 8 years ago

Are the kind of checks we are talking about supporting with full R pure data checks (vs metadata or congruency checks)? How much analysis is needed of a dataset (vs just metadata) before a researcher can determine the data is usable?

I think having the answers to these may help us make a better informed decision regarding Renjin vs official R.

amoeba commented 8 years ago

How much analysis is needed of a dataset (vs just metadata) before a researcher can determine the data is usable?

I get that we want to answer this but I don't think this is answerable. We're building a system that supports use cases we haven't seen yet. Building a system that successfully runs the LTER 32-check suite doesn't necessarily get us a system that other people would find useful.

Here's a check that would give us trouble if we stuck with Renjin:

<check>
    <name>all URLs can be resolved</name>
    <selector>
        <name>urls</name>
        <xpath>//gmd:URL</xpath>
    </selector>
    <expected>TRUE</expected>
    <environment>r</environment>
    <code>
        <![CDATA[
      library(httr)
      all(lapply(lapply(urls, HEAD), function(x) x$status) == 200)
    ]]>
    </code>
</check>

This uses the httr library, which can't be used with Renjin, to check the status code of an HTTP HEAD response for every gmd:URL in the document is 200. I think it's a pretty reasonable thing to expect someone to write. I don't if there's a way to work with HTTP requests with base R other than reading and writing to OS sockets (fun!).

gothub commented 8 years ago

I was kinda referring to the earlier statement that we need to see some of these TBD use cases and advanced checks before we can state whether or not Renjin is sufficient. Once we see a use case or check that can't be supported by the current engine, then we have a data point to assist in making a decision, and it looks like your above example is a data point for full R.

amoeba commented 8 years ago

I take your point, and yes, my check cannot be done with Renjin and likely never will be able to be done in Renjin. I could write a hundred data points out but they might all be straw men. Maybe we need to incorporate not-us users like you say.

How are we going to get data to the R checks? I see in your Python checks you use urllib. What do we have in R that will work with Renjin?

leinfelder commented 8 years ago

Based on the clamor, I've drafted an alternative R dispatcher. It calls out to rscript with a script file (code of the check with some pre- and post- bits to make sure the variables are available), an input file (variables serialized as JSON) and an output file (result serialized as JSON). This allows us to use any R installation and the packages that are installed with it (no Renjin at all).

It seems promising so far, and the only thing left to add to this RDispatcher is an ability to chain it before other checks and have the variable bindings transfer over. It can inherit from others, it just doesn't export out currently.

NB: It's been interesting trying to get Travis CI to pass the tests since there always seem to be differences in R across OS, versions and libs installed...

I hope this satisfies everyone's need for complex R analyses!