Alternative backends? - Githubissues

ariya commented 12 years ago

Whether is possible to have alternative backend for phantomjs instead of QtWebKit?

Specifically, it will be great to have chromium webkit integrated since it seems to be more feature rich and stable (QtWebKit render some pages incorrectly when chromium render these pages pretty well).

Disclaimer: This issue was migrated on 2013-03-15 from the project's former issue tracker on Google Code, Issue #209. :star2: 7 people had starred this issue at the time of migration.

ariya commented 12 years ago

ariya.hi...@gmail.com commented:

All I can say is that the effort would be non trivial. However, we should definitely keep an eye (or two) on this.

In all cases, we need to increase the team size first before embracing this adventure. Even if the backend is finally there, there is a significant cost with respect to the effort to maintain it.

Metadata Updates

Label(s) removed:
- Type-Defect
Label(s) added:
- Type-Enhancement
Milestone updated: FutureRelease (was: ---)

ariya commented 12 years ago

ariya.hi...@gmail.com commented:

There are projects like Chromium Embedded, Berkelium, and Awesomium (closed sourced) which can be the another source of ideas and inspirations.

mralexgray commented 12 years ago

a...@mrgray.com commented:

My 2 cents: I think a good plan of attack would be using the Webkit2PNG foundations, etc.. and using the native cocoa / similar API's to get the same accesss to the the DOM as is being achieved through QT4. a lot of people don't have QT4, and don't particularly want it, lol.. and webkit has SO MANY ative API's to get at the DOM... ala shouldChangeSelectedDOMRange etc...

don't spend too much time on this board... so this might be childs play.. but the attached script does quite a bit on top of webkit2png, also.. it saves the thumbnails, blah blah blah, but it also creates a nice little xml page, and form map, like ...

Small dependencies... try: import Foundation import WebKit import AppKit import objc import urllib

posting it here in case it's of use....

ariya commented 12 years ago

ariya.hi...@gmail.com commented:

The Qt problem is a moot point, especially with the static build script (see issue 197 and issue 142) and see also issue 226.

The use of Mac WebKit library via Cocoa does not help much for non Mac users. We certainly don't want to end up implementing 2 versions of the PhantomJS API: one for Mac and one in he rest of the world.

thomasbachem commented 10 years ago

New development: http://blog.qt.digia.com/blog/2013/09/12/introducing-the-qt-webengine/

JamesMGreene commented 10 years ago

@thomasbachem: We concur, this seems like the way to go! We've already been chatting about it on Twitter, and Ariya just posted a new message to the mailing list.

brodock commented 10 years ago

http://blog.qt.digia.com/blog/2014/01/23/qt-webengine-technology-preview-available/

jokeyrhyme commented 10 years ago

GitHub's Atom might offer an easier way to harness Chromium: https://github.com/atom/atom-shell

ariya commented 10 years ago

@jokeyrhyme Hardly changes anything. PhantomJS needs something deeper, i.e. the already-available Chromium Content Shell.

jokeyrhyme commented 10 years ago

@ariya there's the Atom announcement post here: http://blog.atom.io/2014/05/06/atom-is-now-open-source.html

Finally, we're just as excited to be open-sourcing Atom Shell as we are about Atom itself. Over its 2.5 years of development, Atom has been something of a hermit crab, beginning its life in a Cocoa WebView, then migrating to the Chromium Embedded Framework, and finally making its permanent home inside Atom Shell. We experimented briefly with Node-Webkit, but decided instead to hire @zcbenz to build the exact framework we were imagining.

We've taken great care to integrate Chromium and Node in a clean, maintainable way, including sponsoring the addition of multi-context support in Node. We also created brightray and libchromiumcontent, which make it easier to embed Chromium into native applications as a shared library.

Are brightray and/or libchromiumcontent of any use in this endeavour?

ariya commented 10 years ago

@jokeyrhyme Maybe yes, maybe not (too early to say). While I appreciate the information, what is useful for this adventure is not a list of all Chromium-related projects out there. A thorough technical analysis is going to be more valuable.

masi commented 9 years ago

Is QtWebEngine an option? PhantomJS 2.0 builds on top of Qt 5.3, but I had the impression that Qt 5.4 is the lastest stable release.

zackw commented 9 years ago

Not being so tied to a particular, patched version of Qt(/Webkit) seems like a necessary prerequisite for being able to swap out the rendering engine more easily.

For what I do with PhantomJS, being able to match the rendering engine of a particular release version of Safari or Chrome would be quite valuable, but that's an even bigger can of worms.

hadim commented 8 years ago

Did you decide yet which backend you'll use in the future ? QtWebEngine or Electron or another ?

vitallium commented 8 years ago

@hadim not yet. We don't have time on that.

milianw commented 8 years ago

Hey all,

I just investigated Qt WebEngine [QWE] for the PhantomJS use-case a bit. Here are my notes:

a) QWE cannot be statically build, esp. considering that it relies on a multi-process architecture it would be hard/impossible to get a single phantomjs binary that can be deployed to servers. --single-process mode may help, but this can make things complicated.

b) there is currently no real printing support. It should be simple to get hands on a PDF generated by chromium, and forward that to PhantomJS though. No idea about PNG etc. screenshots though.

c) QWE depends on Qt Quick & QML for the scene graph. This is a big dependency, and drags in OpenGL etc. pp. It could be quite cool to get rid of many PhantomJS parts by just reusing QML (it is a JavaScript runtime after all). But the OpenGL dependency is unfortunate for running PhantomJS on the cloud. Might be remedied via either mesa software rendering or the commercial 2D painter.

I will extend this once I know more.

Cheers

vitallium commented 8 years ago

@milianw I tried to investigate this too (and still continuing tho). And I've found that QWE (and Chromium) doesn't support full headless mode, can you confirm that? There is only one way to run it in headless mode by running in with Xvfb.

Thanks.

milianw commented 8 years ago

@Vitallium yep, headless is out of the question as it depends on OpenGL which in turn depends on XCB etc.

I'll have a look at CEF (chromium embedded framework) now. I have the feeling that it's a better choice for the future of PhantomJS. It's just a minimal wrapper (or so I hope) around Chromium. We don't need most of Qt for PhantomJS, just Chromium should be enough.

Modules like the web browser or file system in PhantomJS can/should probably be replaced/removed. Node.js and others already fill that hole well enough, imo.

The ideal outcome, imo, would be a minimal remote-controlled browser that integrates well with node.js. Do you agree?

vitallium commented 8 years ago

@milianw yes, I do. I think in the same way actually. We don't need Qt at all. Chromium's code base should be enough to handle all our requirements and needs. Ideally, I want to make it like Electron, but with our API and other stuff.

zackw commented 8 years ago

Lemme note down some things about what I use PJS for, what's hard now and what needs to not break. For reference, this is my controller script: https://github.com/zackw/tbbscraper/blob/master/collector/scripts/pj-trace-redir.js (The name is no longer meaningful.)

Things that need to not break:

I need to be able to log every HTTP request and response during a page load, including full headers, and ideally also data bodies (in both directions). I'm currently using a log format that I made up, but switching to (extended) HAR is on the todo list. I'm not doing it now, but I might in the future want to selectively filter or modify requests. Qt WebEngine, last I looked at it, didn't offer that level of access.
--load-images=no absolutely must keep working; I'm poking a lot of very sketchy websites and do not want my database seized for copyright violation or whatever.
I wrote and landed a patch that makes it look like the browser has never even heard of file: URLs, that also needs to keep working.

Things that are hard or impossible now:

I want to be able to log every getaddrinfo() operation, or even better, detailed DNS packet decodes. This currently isn't possible; I spent two days once and couldn't even find where Qt calls getaddrinfo.
page.onLongRunningJavascript - gosh, it would be nice if that worked.
It would be nice if the controller script got a notification not only at approximately onload time for the page, but at a point when JavaScript is "done executing" in the page (for some value of "done executing" that isn't the Halting Problem in disguise). Right now I have a hardwired timeout.
I want to make my scraper as indistinguishable as possible from some concrete released version of a web browser that is actually used by humans. In addition to munging User-Agent, I would ideally like to be using the exact same network library as a real browser, and have no detectable way for page JS to call back into the controller script.
In order to ensure that every page load is completely isolated from every other page load, in the face of arbitrarily malicious pages -- up to and including WebKit zero-day remote-code-execution attacks -- each PhantomJS process runs under its own Unix uid with no write access to anything but its own home directory, exits after one page load, and a watchdog erases the home directory. (The watchdog will also kill the process if it runs too long.) In order for this to not suck, startup+teardown time for PhantomJS needs to be kept small. As much as I want Chromium-style content process isolation, I'm worried that it will be unacceptably costly.
Code evaluated in page context by the controller script is a little wonky, I would like it to behave more like code injected into pages by Chrome or Firefox extensions. In particular, it should be insulated from the page's own code screwing with the JS environment, and it should be exempt from cross-origin restrictions.

mellon85 commented 8 years ago

I have a use case similar to @zackw ; besides the DNS queries and load-images (I want to fetch/check those too), I need deep access to the networking system and the all the other features, that I kind of implemented in a long javascript scraper

zackw commented 8 years ago

@Vitallium As soon as you have something that I can help with, please do let me know.

vitallium commented 8 years ago

@zackw well, at this moment I'm just playing with Chromium and WebKit. Each engine has its pros and cons. With WebKit we can guarantee heedlessness for users, but it doesn't have OS specific features like file system or image handling. With Chromium we have everything that we need but Chromium is insanely huge and complex. Sometimes I have no idea what I'm doing. And Chromium doesn't support headless mode. This is very important thing.

jokeyrhyme commented 8 years ago

Is it necessary to support environments that do not have Xvfb? I understand having fewer dependencies is preferred, but how is requiring Xvfb for Chromium better / worse than the complexity of batteries-not-included WebKit?

If we can identity an important use case where Xvfb is infeasible, then that seems like a way to exit early from the Chromium approach.

milianw commented 8 years ago

FYI, I'm pushing my WIP to https://github.com/KDAB/phantomjs-cef - of course it is currently not functional at all. But from what I've seen so far, it looks good. There is some sort of offscreen rendering, which I haven't implemented yet. The settings are very extensive, and disabling SSL error checks, image downloading, web security, etc. pp. should work just fine.

Tomorrow, I'll try to get PDF printing done, which I haven't figured out yet. Then, I'll tackle the JavaScript bindings to get the good old PhantomJS behavior up and running again. @Vitallium, or anyone else: if you want to chime in, i.e. add patches - you are more than welcome!

I think getting a first proof of concept done in a scratch repo would be a good idea, then we can think about how to integrate it with the upstream repo.

Some issues I've had so far:

building CEF from sources, I tried many times, it never worked
similarly, I have no idea how to create a static build, or if it is even possible. https://bitbucket.org/chromiumembedded/cef/wiki/LinkingDifferentRunTimeLibraries sounds as if it should be possible.
there are quite some dependencies, i.e. on X11 - not sure whether this can be disabled when building CEF from sources. potentially, we might need to depend on Xvfb.

vitallium commented 8 years ago

@milianw You rock! :+1:

I think, since we going to use Chromium, we don't need to focus on static builds. Let's try with shared first. I'll start playing with CEF from now.

PS: After a few tweaks we can run it on Windows :-)

JamesMGreene commented 8 years ago

Hell yeah, gentlemen! :clap:

Long live, Phantomium! :ghost: :crown:

vitallium commented 8 years ago

@milianw FYI: I'm working on Windows branch here: https://github.com/Vitallium/phantomjs-cef/tree/windows

milianw commented 8 years ago

@Vitallium: I'm playing around with the JavaScript bindings now, i.e. bootstrap.js and require() etc. pp. I'm realizing that I'd really like to have a cross platform resource system. The cefclient example does something, but the resources are only embedded on windows, but not on Linux.

That, and the cefclient example using GTK for its OSR rendering makes me think if we should integrate QtBase with CEF. We all have experience with Qt, and using an unpatched Qt 5 base as an additional dependency to CEF for cross platform resource systems and painting sounds like a good idea to me. I've found https://github.com/joinAero/qtcefclient which is outdated and apparently windows only, but it shows that CEF + Qt is possible.

The advantage over Qt WebEngine is that this does not include Qt Declarative (i.e. Qt Quick + QML) as a dependency.

What do you guys say?

vitallium commented 8 years ago

@milianw I'm playing with it too and I came to the same thing. We really need a cross platform resource system.

I think about the same system as Node.js has. Generate all headers with all included modules (bootstrap, require, etc.).

About the QtBase. If I understand you correctly, you want to add dependency to QtBase to achieve following goals:

a cross platform resource system
OSR rendering (painting and other stuff which we need)

Is this correct? If so, I don't mind. But that means we have to integrate and handle an additional message loop that comes with QApplication. We can use project qtcefclient as a start point to implement it.

milianw commented 8 years ago

Yes, that is exactly what I have in mind. I'll start playing with QtBase + CEF now, and see how I can integrate the message loops.

vitallium commented 8 years ago

Great. Then I'll start with... Err... I'll find something!

milianw commented 8 years ago

Using RCC is simple, and it should be similarly trivial to use QPainter or Qt OpenGL abstractions where needed. What we don't get though is a nice integration between the eventloops. For my use case, that isn't really required yet so I simply don't run Qt's eventloop for now.

The big next task will be to get the bootstrap.js and webpage.js to work...

JamesMGreene commented 8 years ago

Is there no equivalent means of achieving the same as what QtBase provides in Node.js? I'd really love to see Phantomium be closer to pure Node.js + CEF with a merged V8 event loop so that we can enable all consumers to use standard Node.js modules and paradigms rather than having to learn the quirks of a Qt environment... but, admittedly, I don't fully understand what QtBase is providing us.

milianw commented 8 years ago

Potentially, one could investigate how to wrap CEF in a node module. Instantiating CEF and spawning its subprocess es from a thread may work.

But right now, I just want to get something done, and as quickly as possible. Having to learn node.js internals would hold me up more. I use QtBase currently for:

cross-platform resource system, i.e. compiling JavaScript code into the binary
some containers, since I'm seeing very odd crashes when using STL: http://www.magpcss.org/ceforum/viewtopic.php?f=6&t=13543
QUrl to create a proper file:// url for the script argument path
QJson to parser JSON, required for the custom IPC

All of the above may, or may not, work with node. Esp. if it's using the STL (which it hopefully does!), then it may lead to the odd crash I note above...

In the future, I will also use Qt to implement the offscreen rendering, and I doubt node has anything to offer in that regard.

milianw commented 8 years ago

Now, having answered the above, here a quick status update:

I finally got webpage.open, close, evaluate implemented! A big caveat is that the synchronous API to evaluate JavaScript in a webpage is not supported anymore. Chromium, like WebKit2, and thus also CEF, is using a multiprocess architecture for stability and performance reasons. IPC is inherently asynchronous, and I'm reluctant to add blocking API like page.evaluate.

Instead, I opted for a completely async approach, i.e. stuff like

page.evaluate(
    function() { return window.location.domain; },
    function(ret) { console.log(ret); },
    function(errorCode, errorMessage) { console.log(errorCode, errorMessage); }
);

You guys are all better JS developers than me. So: What's the current best practice to design async API in JavaScript? Is the above good enough? Or should one rather apply some continuation pattern with .then()? Should it be two callbacks for success/error, or one to handle both?

By slightly adapting the followers.js example, I could already run it with the cef-phantomjs, which is pretty neat I think.

jokeyrhyme commented 8 years ago

Promises are in ECMAScript and Node.js now, and many upcoming improvements to W3C Web Platform APIs like fetch() and getUserMedia() are Promise-based.

That said, the CommonJS pattern where a single callback is passed, with an Error object as the first argument in case of error, is a pretty expected pattern within the Node.js community. E.g.

page.evaluate(
    function() { return window.location.domain; },
    function(err, ret) {
        if (err) { /* TODO: handle error */ return; }
        console.log(ret);
    }
);

If possible, it'd be terrific to support both. I personally prefer Promises, but as the intended use of this is within the Node.js community, I think the error-first callback is probably the mandatory pattern here. It is possible to author functions in a way that both return a Promise and accept a callback. And I think there are even utility libraries that facilitate this style.

milianw commented 8 years ago

Thanks for the hint, @jokeyrhyme! Promises work a treat:

https://github.com/KDAB/phantomjs-cef/blob/master/examples/load_promise.js

milianw commented 8 years ago

Just a heads up: The last week was pretty productive in phantomjs-cef land and most important features have landed: render, renderBase64, sendEvent, evaluate, injectJs, ...

I especially like how well PhantomJS(-CEF) works with a Promise driven API:

https://github.com/KDAB/phantomjs-cef/blob/master/examples/tui.js

Note how there are no explicit timeouts, rather DOM polling is wrapped in a Promise via https://github.com/KDAB/phantomjs-cef/blob/master/examples/libs/waitForDomElement.js and the new page.waitForLoaded() also uses a promise to wait until a page has finished loading after submitting the form.

I'll probably spent a bit of time on Windows support the next days. In general, I think this is a very promising result already, and I invite more people to join the effort.

milianw commented 8 years ago

If anyone wants to test PhantomJS-CEF on Windows, I just pushed a first build: https://github.com/KDAB/phantomjs-cef/releases/tag/v0.1.0-alpha

use at your own risk of course ;-) But I'd appreciate any feedback.

vitallium commented 8 years ago

Hey! Good job! I have Windows build too. But, hell, that was a really busy week for me. But now I'll help as much as I can. Cheers!

But one question: what about OS X build? I'm not an expert in it.

milianw commented 8 years ago

@Vitallium you have a lot of Windows experience, right? Could you have a look at the debug build of PhantomJS-CEF on Windows? See: http://www.magpcss.org/ceforum/viewtopic.php?f=6&t=13578&p=28331#p28331

I build it against the FOSS Qt 5.5.0 release (msvc2013_64) using

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Debug -GNinja ..
ninja
phantomjs-cef\phantomjs ..\examples\load_promise.js

and it will assert (see forum message). Can you reproduce that issue? Do you know what's going on there? Maybe we'll need to build CEF from sources or something?

vitallium commented 8 years ago

@milianw so far I use a debug version, and I don't see any assertions, except the one on the exit. But let me try a fresh copy of your repository.

milianw commented 8 years ago

Build was updated to use a static Qt and MSVC runtime.

ariya commented 6 years ago

Given the limited resources, looks like we're stuck with QtWebKit for the foreseeable future.

ariya / phantomjs

Alternative backends? #10209