Crystalix007 / CutyCapt

A Qt WebKit Web Page Rendering Capture Utility
http://cutycapt.sourceforge.net/
0 stars 0 forks source link

Cutycapt QtWebengine #2

Open Crystalix007 opened 3 years ago

Crystalix007 commented 3 years ago

Issue by RvdHout Friday Jun 29, 2018 at 13:05 GMT Originally opened as https://github.com/hoehrmann/CutyCapt/issues/25


Any plans or has anyone actually tried to convert CutyCapt to use (the new) QtWebengine instead of QtWebKit?

Crystalix007 commented 3 years ago

Comment by Crystalix007 Monday Jun 22, 2020 at 13:23 GMT


See Crystalix007/CutyCapt for a version using the new QtWebEngine. Enough features needed to be removed / changed that I don't feel comfortable submitting a pull request, however, it does function for what I've tested it with.

Crystalix007 commented 3 years ago

Comment by RvdHout Monday Jun 22, 2020 at 18:24 GMT


I have build your version with QTCReator using msvc2017 (64), most options seem to work, only thing i noticed i can't output a PDF, can you confirm --out-format=pdf doesn't work?

Crystalix007 commented 3 years ago

Comment by Crystalix007 Monday Jun 22, 2020 at 20:28 GMT


Please see the latest commit. I was quitting too early (before the PDF ever had a chance to render), because I wasn't aware that the PDF rendering was asynchronous.

Crystalix007 commented 3 years ago

Comment by RvdHout Tuesday Jun 23, 2020 at 15:51 GMT


Yes that seems to work....i noticed you switched to printToPdf....the old method could not be used anymore? Asking for (optional/additional) papersize, orientation, quality & margin params

FYI, --user-agent param seems to be ignored

Crystalix007 commented 3 years ago

Comment by Crystalix007 Tuesday Jun 23, 2020 at 15:56 GMT


It would definitely work with the previous print method, but printToPdf also has a lot of configurability (see Qt docs) as well. I just haven't added CLI options for them. Probably, because I don't personally need it, I'm not likely to implement it.

I don't see why useragent is being ignored. The processing is still there. Potentially I mucked something up when configuring the WebEngine object however.

Crystalix007 commented 3 years ago

Comment by RvdHout Tuesday Jun 23, 2020 at 15:59 GMT


I think --useragent needs QWebEngineProfile, not? https://doc.qt.io/qt-5/qwebengineprofile.html#httpUserAgent

Crystalix007 commented 3 years ago

Comment by Crystalix007 Tuesday Jun 23, 2020 at 16:22 GMT


Yep. Just fixed it so that a user agent is set in more than just the member variable of CutyPage.

Crystalix007 commented 3 years ago

Comment by RvdHout Wednesday Jul 29, 2020 at 10:59 GMT


Hi @Crystalix007, me again....i noticed some issues with the way you made pdf rendering work again, it seems you use --max-wait parameter for this, right? Now lets say i set the --max-wait parameter like: --max-wait=10000

Every PDF generated takes 10 seconds to complete... even if it is created much faster (very simple PDF made from a webpage with small table) Now i can simply lower the --max-wait parameter, but that might interfere with other timeouts within cutycapt... is't there a better way to check PDF generation is complete?

Crystalix007 commented 3 years ago

Comment by RvdHout Wednesday Jul 29, 2020 at 13:02 GMT


--max-wait=<ms> Don't wait more than (default: 90000, inf: 0)

Btw, this also interferes with --max-wait=0 (infinite)

Crystalix007 commented 3 years ago

For the --max-wait parameter, it doesn't currently detect when the full page has finished the layout.

In the port to QtWebEngine, one of the signals, for when the page has finished laying out its elements, no longer exists. The solution supposedly (from what I've read) is to introduce websockets on the WebEngine side, and then inject JS into the page which notifies the webengine that the page has finished loading once layout is complete.

This is a rather large change, and not one I plan to do in the near future.

The max-wait of zero here currently causes it to wait forever, I believe, but this can't really be fixed without the websockets. The page just renders as blank if you try and render it before the elements have been laid out.

RvdHout commented 3 years ago

QWebEngineView::loadFinished()?

The loadFinished() signal is emitted when the view has been loaded completely. Its argument, either true or false, indicates whether loading was successful or failed.

But that isn't really the issue i encounter (i think) as png, jpg and other formats do get captured quick(er) apposed to pdf, pdf generation seems to use the --max-wait parameter as trigger, how do png and jpg know the dom is loaded?

Crystalix007 commented 3 years ago

The loadFinished merely signals that the document has loaded all remote resources, AFAIK, thus only downloads all the resources, not necessarily renders them.

On master, I see no difference between how quickly pdf vs png & jpg render, because neither file formats should differ in how they handle the process until they begin saving (i.e. after the timeout has expired). Perhaps you simply see a difference in processing / saving speed?

If you want to explore the dangerous version which doesn't wait for the elements to all render (only for QWebEngineView::loadFinished()), you can try the no-render-delay branch. This may work for you for PDFs, but for images, it tries to render way too early. I personally wouldn't risk it if you need to make sure all the content rendered.

RvdHout commented 3 years ago

I'm (almost) certain that somewhere and i expect it to be somewhere were pdfPrintingFinished() is called the delay is caused i tried generating the same PDF with various --max-wait values, resulting in the same rendered PDF but only with different render times (and render time is around that what is set as --max-wait)

When printToPdf is called --max-wait seems to be --min-wait (maybe that makes it clearer?)

I will give your other branch a shot

RvdHout commented 3 years ago

I see no difference in performance using no-render-delay branch, this still seem to use --max-wait as --min-wait

EDIT, if I omit --max-wait it seams faster, strange as the default --max-wait value is way higher as i am using, this is getting weirder and weirder

Crystalix007 commented 3 years ago

On the no-render-delay branch, where /tmp/CC.html is this issue page:

Running time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.png >/dev/null 2>&1 ; done ), I get 75.01s user 14.86s system 84% cpu 1:46.97 total.

On the same branch with --max-wait=10000:

time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.png --max-wait=10000 >/dev/null 2>&1; done ): I get 74.86s user 14.66s system 83% cpu 1:47.53 total.

Trying the same for pdf output:

time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.pdf --max-wait=10000 >/dev/null 2>&1; done ): I get 80.27s user 14.92s system 69% cpu 2:17.72 total.

I then tried the same on the master branch:

time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.png --max-wait=10000 >/dev/null 2>&1; done ): I get 73.96s user 14.53s system 83% cpu 1:45.40 total

Trying on master with pdf:

time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.pdf --max-wait=10000 >/dev/null 2>&1; done ): I get 83.84s user 14.71s system 68% cpu 2:23.67 total.

time (for i in {1..100}; do ./CutyCapt --url='file:///tmp/CC.html' --out=render.pdf --max-wait=1000 >/dev/null 2>&1; done ): I get 83.82s user 14.66s system 68% cpu 2:23.76 total.

So the PDF is generally slower at saving I guess, but doesn't really seem to wait the full time. Also, given that PDF and PNG times are so close, I don't think I can agree about the max-wait delay being different for different file sizes.

Theoretically, with 100 iterations each taking up to 10 seconds, that could take up to ~ 1000 seconds, which rounds to 17 minutes, so I don't think it's waiting the max time for any of them.