datopian / datapipes

Data Pipes for CSV
https://datapipes.okfnlabs.org/
MIT License
117 stars 16 forks source link

Arbitrary map/filter functions #21

Open rossjones opened 11 years ago

rossjones commented 11 years ago

(was: Provide JS sandbox for user-specified filter functions)

It would be great if users could provide a filter function to be executed on each row.

This would be more powerful than grep as it could take into account values in other cells. And something similar could also map a new column onto the table using a user-specific filter (for example).

Something like http://gf3.github.io/sandbox/ looks like a reasonably good solution for JS. This particular one would be inproc, but can imagine other languages being allowed to run code over the rows in a different type of sandbox.

rufuspollock commented 11 years ago

Really like this and this was, in fact, the original idea (arbitrary) transforms. How do we pass these in? My guess would be to allow use to point to a js file on the web (e.g. a gist) which contains the code to run.

davidmiller commented 11 years ago

So an api along the lines of

http://localhost:5000/csv/head/html/transform http://google.com/some.js/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

Something like requiring the file to define a transform function via commonjs[1] module style which we will pass two arguments, the row, and the index.

Returning null will exclude the row from further transformations, and move on to the next row in the stream.

[1] http://www.commonjs.org/

rossjones commented 11 years ago

It would be even more awesome if I could post the script to an endpoint ( /install perhaps) with

{
    "language": "javascript",
    "name": "transform",
    "code": "....."
}

so that I can then do

http://localhost:5000/csv/head/html/transform/?url=http://static.london.gov.uk/gla/expenditure/docs/2012-13-P12-250.csv

This'd let me 'install' code for re-use by myself and others.

Agree that having a function signature to be implemented like @davidmiller suggests would be good. Suspect it'll also need a cookie/user var to allow scripts to maintain state without globals (for instance to store the tail buffer). Perhaps something like ...


// Called before the first row is sent, expected to return some indication
// that it wants to continue (or perhaps might be skipped).  Cookie provided
// here for state, will be passed to all other funcs.  Should also be passed 
// url args and then store them in the cookie.
function start(cookie, args) {}

// Called on each row
function transform( cookie, idx, row ) {}

// Called after all rows finished.  In some cases (tail perhaps) this is where 
// the actual data will come from, but would expect normally for result of 
// transform to be the thing that is piped across to the next function.
function end(cookie) {}
davidmiller commented 11 years ago

It would be even more awesome if I could post the script to an endpoint

Oh, you mean scraperwiki 1.5 ? ;)

That increases complexity by an order of magnitude (have to manage namespaces/accounts/global function registry) while increasing utility a bit. e.g. it's a nicer API. (Which it totally is)

OTOH running from a URL in a sandbox becomes significantly easier to implement, and we can figure out if people really use the feature.

POSTing to a gist/pastebin/(your publish text on the internet service here) sounds like a simplest thing that could work halfway house to me

rufuspollock commented 11 years ago

@rossjones I also thought about the install stuff ;-) Issue is we start having login and storage somewhere but not that difficult (I'd do my usual thing at the moment and do github login + storing in a gist).

However my concerns were similar to @davidmiller, namely increase in complexity compared to increase in benefit. Given KISS principle a first pass would be I think to not allow storing scripts - ie. its up to user to store them somewhere.

This seems pretty straightforward to implement and would be pretty awesome ;-)

davidmiller commented 11 years ago

a cookie/user var to allow scripts to maintain state without globals

So AMD gives you closured globals if/when you need 'em, but there is some manual taking care you'd have to do...

Passing around (and us keeping track of) each function's state/scope object (hereafter known as "The Angular.js Pattern") is, you know, a bit of a faff, with the only real benefit being the dependency injection benefits for your unit tests.

And we all expect that unit tests are going to be ubiquitous for this kind of thing rite? ;)

One alternative recipe would be to require the exported transform to be an object containing the methods

(and we force the scope of this for them ) (hereafter known as "The Backbone.js Pattern")

The User then gets to do their own state management in a constructor, and I no longer have to care/know about it/can't interfere :)

Other patterns are available :) Although that'd be my preference right now

rossjones commented 11 years ago

I keep forgetting most JS stuff doesn't need to be re-entrant.

Maybe worth just having a 'gist' op that takes the ID as a parameter?

rufuspollock commented 11 years ago

@rossjones huge +1 for the simple gist op with id as parameter (super clean ...).

@andylolz this might be the most fun thing to implement and its super cool ;-)

andylolz commented 11 years ago

V cool indeed! Will make a start on this one next. soon!