ggaughan / pipe2py

A project to compile Yahoo! Pipes into Python (see it hosted on Google App Engine: http://pipes-engine.appspot.com)
http://wiki.github.com/ggaughan/pipe2py
GNU General Public License v2.0
317 stars 51 forks source link

pipe2py running in scraperwiki context #16

Open psychemedia opened 12 years ago

psychemedia commented 12 years ago

I have a minimal example of pipe2py running in process on Scraperwiki at: https://scraperwiki.com/scrapers/pipe2py_test/edit/

I'm not sure about what the best way of thinking about pipes on scraperwiki is? My first thought is that a scraper db could be used to hold a pipe definition (or set of pipe definitions where there are pipes embedded in pipes) (which would require a one off 'configuration scrape'/pipe export phase to get hold of the data from Yahoo pipes) then the scraper could be scheduled to compile and run the pipe and drop the output from executing the pipe into a second dbtable?

ggaughan commented 12 years ago

A few initial thoughts - possibly a combination of all is needed.

The pipe2py compile.py build_pipe method (the pipe interpreter) could recursively call parse_and_build_pipe for each embedded pipe. And a corresponding modification for the write_pipe and parse_and_write_pipe methods (the pipe compiler) should probably be made. This would allow pipe2py to automatically discover and build (or write) and then import all embedded pipes. This would be useful by itself, but I think such a modified build_pipe method could be used in Scraperwiki to load and run the full pipe definition directly from Yahoo! Pipes as and when needed.

Such a dynamic import would still rely on Yahoo! Pipes responding with the pipe definitions each time, so the next thing would be, as you say, to have a (Scraperwiki?) module that could store and retrieve the pipe json definitions. These could then be stored once and loaded from Scraperwiki's database and interpreted each time from there. Python has a neat way (of course) of handing off such module loading from unusual places - PEP 302. I used such a method in pipes-engine.

So perhaps what's needed is some kind of plug-in to pipe2py that can load pipe json definitions from a variety of external systems (the various GAE databases, Scraperwiki database, S3, etc.) and also save them there. The existing "load from stdio/file/YQL" and "write to stdout/file" would then become instances of such plug-ins.

Output from the pipe could be passed through a final pipe2py output-module, one for each format (currently only json). Though a method for configuring the target of such output-modules would need to be devised - probably via augmenting the calling mechanism.

I can smell meta-pipelines...

psychemedia commented 12 years ago

I started exploring popping "top level" pipe descriptions for pipes associated with a particular user's account (their public/published pipes) into a scraperwiki db table here: https://scraperwiki.com/scrapers/pipe2py_test/ but it doesn't take into account nested pipes. (An example of running a pipe whose description is stored in a scraperwiki db is here: https://views.scraperwiki.com/run/pipe2py_test_view/ )

I guess from a shamdeveloper perspective, it would be handy to be able to set some sort of config parameter along the lines of "pipe_definition_source" that allows a handler to take in a pipe id and return the JSON description from that source. Ideally, if a required pipe description wasn't found, it should be grabbed from Yahoo pipes and stored in the specified repository,

One nice thing about the Scraperwiki environment is that the output of a pipe can be popped into a Scraperwiki table. Scraperwiki also allows for daily/weekly/monthly scheduled runs of a scraper which means you can use a pipes designed collection mechanism to run locally and regularly on Scraperwiki over an extended period. In this sort of scenario, Yahoo Pipes is essentially being used as a graphical design studio, with all the actual processing happening on Scraperwiki.

The idea of having a read/write plugin mechanism sounds really sensible. Although outside the scope of Yahoo Pipes, providing a complementary mechanism for storing the output of a pipe when it's executed might be also be handy? (I guess when handling pipe output there are a couple of things going on: 1) what representation the output is published in (JSON, XML, etc); 2) where it's published to (eg stdio, file, dbtable)

Taking a step back to the wider system, I guess there are also different calling strategies: eg calling the pipe URL to run it, scheduling it, etc? Again, these are outside the scope of a literal reinterpretation of Yahoo Pipes, but are part of its wider use context.

psychemedia commented 12 years ago

In the Scraperwiki view https://views.scraperwiki.com/run/pipe2py_test_view/?pipeid=e6b016246bed2b99958867e85bf1a390 I try to run a pipe with the specified id by grabbing the pipe description from a scraperwiki db (if it can't find the id in the db, I guess it could try to pull it from Yahoo Pipes?) The example fails because the pipe has a nested pipe: ImportError: No module named pipe_975789b47f17690a21e89b10a702bcbd What would be handy would be able to specify as handler an arbitrary getPipeDescriptionFromID(id) function, that returns a JSON description of the pipe from wherever?

ggaughan commented 12 years ago

I think I've cracked it: https://gist.github.com/2858288

Put this pipeloadYQL.py module in the pipe2py directory. See the example of how to hook it into Python's module import machinery. Then importing any pipe_ modules will try to import from disk first and then try to import them from Yahoo! via YQL. This also includes any nested pipe modules - magic!

If this approach works, we can use it as a base for loading pre-saved json modules from SQL databases and anywhere else. Then I'll add it to the pipe2py package.

psychemedia commented 12 years ago

Thanks - I'll give it a go as soon as my machine comes back from repair shop (using a clunky just-works spare m/c atm). Any updates to the library on Scraperwiki have to go via a request to Francis Irving/@frabcus I think (I can get on to that). The scraperwiki route would be pulling pipes from a scraperwiki db rather than YQL (eg http://blog.ouseful.info/2012/04/10/exporting-yahoo-pipe-definitions-compiling-them-to-python-and-running-them-in-scraperwiki/ ), but I guess I may be able to figure out how to do that from your example...