collective / transmogrify.webcrawler

transmogrifier source blueprints for crawling html
9 stars 5 forks source link

Could not adapt addinfourl to plone.app.blob.interfaces.IBlobbable #7

Open datakurre opened 9 years ago

datakurre commented 9 years ago

I tried out funnelweb.ttw pipeline without any customizations for my Blogger based blog and got

TypeError: ('Could not adapt', <addinfourl at 4458510240 whose fp = <transmogrify.webcrawler.staticcreator.OpenOnRead instance at 0x109bf7128>>, <InterfaceClass plone.app.blob.interfaces.IBlobbable>)

yet, because that was caused by pipeline trying to create files from RSS feed files, I could get pass this by simply ignoring RSS URIs (crawler:ignore=feeds).

So, not sure if that's a real issue, but if someone gets that, this could be fixed by adding a proper adapter from addinfourl to IBlobbable (a lot of examples in plone.app.blob).

djay commented 9 years ago

I suspect the funnelweb.ttw is way out of date. I have a unreleased version of transmogrify.webcrawler which now gets rid of staticcreator and its own cache and replaces it with requests, and a ZODB based cache.

djay commented 9 years ago

BTW, funnelweb.ttw was originally supposed to be used with the part of mr.migrator which created a UI for transmogrified inside plone. The problem it faced was the use of the plone ZODB being used as the cache for all the crawled content which just resulted in a lot of extra data that didn't end up getting deleted and also memory issues since its a single transaction. I'm not sure how this new code, which uses a standalone ZODB, is going to interact when a zope servers transactions.

I did have a different idea on how to use funnelweb with standard plone transmogrifier blueprints. A kind of client/server blueprint. A blueprint which opens a connection up to zope and sends zope a pipeline definition (which has to include blueprints which has code already in zope). The part outside zope then takes all the data it receives and serializes it into json or pickles and streams it to zope who them executes it against the uploaded pipeline definition. Not sure on the security of it, but it would solve the problem of not having to install and then run complex stuff inside zope just to upload content into plone. And removes all the complex code of transmogrify.ploneremote without losing any flexibility.

datakurre commented 9 years ago

@djay My intent of using transmogrifier view bin/instance -Osite run is more like "run once" or "schedule with cron" approach. It might take a lot of memory, and be slower because of starting with empty ZODB cache, but since it's a completely dedicated process, it will definitely release all memory at the end. It's simple, because all transmogrier-related code need to be available only for the dedicated instance script.

It sounds like you have had more complex use cases. More synchronization than migration.

In the future I'd like to try out transmogrifier with RabbitMQ, so that the first source blueprint/iterator is actually a long running AMQP consumer and will yield a new message from the queue every now and then, depending on the final use case (distributed publishing, content synchronization, etc..).

djay commented 9 years ago

I run shared zope instances so I don't like to instance site specific software. Sometimes the beginning of a pipeline includes changing blueprints but the end is almost always the same. It also helps the case where someone doesn't have access to buildout or doesn't want to. Its not great to have to recompile the production server to allow a new import.

djay commented 9 years ago

Being able to run in an instance is nice but doesn't help with most of my cases. It would be good to see this used on plone.org to sync plone sphinx documentation back into plone so it can be searched on plone.org. For the client/server model, next content conversion job I will have a go implementing it. Currently ploneremote doesn't handle dexterity content types (because dexterity, in its infinite wisdom, chose an api that is not compatible with either ZMLRPC or restricted python, making it impossible to use outside of zope). So I will write a client blueprint that connects and sends the remainder of the pipeline its a part of and streams data, and a zope view that accepts the pipeline and runs transmogrifier with the streamed data as input. Should be reasonably easy except that I'm not sure zope can accept post requests as a stream :(=

datakurre commented 9 years ago

@djay Sounds complex. Why not just implement @@xmlrpc -view with generic getters and setters (or a generic update method) for dexterity content? It's not as convenient as having XMLRPC-setters directly on the object, but it would work. That's the price for Dexterity choosing "pythonic" attribute access over separate setter and getter methods.

(Also it would be possible to fix restricted python support for Dexterity schema provided fields (I'm sure that at least @jensens would now know, how to implement it properly), but not for behavior provided fields :( But as you've mentioned, that would not bring us any more XMLRPC setters.)

djay commented 9 years ago

Yes those things need to be done.

but there are other problems with the ploneremote approach.

datakurre commented 9 years ago

Then DX missing XMLRPC and restricted python support is not blocking you after all. If you need to optimize transactions, you need to do it in instance code in Python.

Dylan Jay kirjoitti pe marraskuuta 21 11:30:59 2014 GMT+0200:

Yes those things need to be done.

but there are other problems with the ploneremote approach.

  • It's slow. It requires a lot of requests back and forth, esp when working out if something exists or not, and it requires a lot of transactions
  • it now has a lot of code which does not have feature parity with the other plone transmogrifer blueprints such only updating if modification dates change, or only to certain fields. This would allow me to use the standard plone blueprints and improve them instead.=

Reply to this email directly or view it on GitHub: https://github.com/collective/transmogrify.webcrawler/issues/7#issuecomment-6394543