CollaboraOnline / online

Collabora Online is a collaborative online office suite based on LibreOffice technology. This is also the source for the Collabora Office apps for iOS and Android.
https://collaboraonline.com
Other
1.85k stars 702 forks source link

Regression: Opening a frame takes a long time #10042

Closed hfiguiere closed 1 month ago

hfiguiere commented 1 month ago

Describe the Bug

Using the drupal integration (I don't have any other working).

The COOL iframe takes exactly 2 minutes to open (stuck in "load" not displaying anything), vs instantly.

I tracked it down to change I83c09e0a8101931efece42e42e5a8e3abd7e696f that is PR #9777

Steps to Reproduce

  1. Open a COOL frame

Expected Behavior

Loads instantly.

Actual Behavior

It takes minutes to open. The browser (Firefox) stay in a "loading" state.

Screenshots

n/a

Desktop

(Please complete the following information)

Smartphone

(Please complete the following information)

Additional Context

The drupal server is on localhost:8443 and run inside a Docker container. COOL runs with make run on localhost:9980

hfiguiere commented 1 month ago

~I cannot reproduce with the Node sdk example running locally.~

hfiguiere commented 1 month ago

Strike that last, I was testing with the wrong commit (one that was marked good)

hfiguiere commented 1 month ago

(with the original setup) I timed it at 2:00 minutes between the start of the load and the iframe content being displayed.

hfiguiere commented 1 month ago

Seems I can't reproduce with the node sdk example.

sgothel commented 1 month ago

Worked here as well with nodejs .. using https://github.com/CollaboraOnline/collabora-online-sdk-examples/tree/master/webapp/nodejs

.. adding more tests.

sgothel commented 1 month ago

The python / django example seems to reproduce the issue .. .. not working .. unknown reason, see below https://github.com/CollaboraOnline/collabora-online-sdk-examples/tree/master/webapp/python

sgothel commented 1 month ago

The build-in framed doc works though .. http://localhost:9980/browser/dist/framed.doc.html?file_path=/usr/local/projects/collabora/LibreOffice/online/test/samples/writer-edit.fodt

sgothel commented 1 month ago

The python / django example seems to reproduce the issue .. https://github.com/CollaboraOnline/collabora-online-sdk-examples/tree/master/webapp/python

ah .. the python example gives me a WOPISrc validation error, the used /wopi/files/1 n/a ..

wsd-1653059-1653085 2024-09-13 09:52:28.545688 +0200 [ websrv_poll ] INF  #19: parseHeader: Client HTTP Request: POST, uri `/browser/163b4b80/cool.html?WOPISrc=http://0.0.0.0:8000/wopi/files/1` HTTP/1.1, sz[header 679, content 185], offset 679, chunked false, Host: localhost:9980 / User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0 / Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/png,image/svg+xml,*/*;q=0.8 / Accept-Language: en-US,en;q=0.5 / Accept-Encoding: gzip, deflate, br, zstd / Content-Type: multipart/form-data; boundary=---------------------------313513991435228868652782832917 / Content-Length: 185 / Origin: null / DNT: 1 / Connection: keep-alive / Upgrade-Insecure-Requests: 1 / Sec-Fetch-Dest: iframe / Sec-Fetch-Mode: navigate / Sec-Fetch-Site: cross-site / Priority: u=4| net/Socket.cpp:1632
wsd-1653059-1653085 2024-09-13 09:52:28.545696 +0200 [ websrv_poll ] DBG  #19: Handling request: /browser/163b4b80/cool.html?WOPISrc=http://0.0.0.0:8000/wopi/files/1| wsd/ClientRequestDispatcher.cpp:670
wsd-1653059-1653085 2024-09-13 09:52:28.545710 +0200 [ websrv_poll ] WRN  WOPISrc validation error: unencoded WOPISrc [http://0.0.0.0:8000/wopi/files/1] in URL: /browser/163b4b80/cool.html?WOPISrc=http://0.0.0.0:80
caolanm commented 1 month ago

So am I right that in master the nodejs example works?, the python didn't work anyway out of the box before the bisected commit, so what are the steps to reproduce?

sgothel commented 1 month ago

Yes, I can confirm that the SDK nodejs + nodejs-esm works, python not (but also didn't before mentioned patches). The build in iframe test is also working. Current master 9f6782dafff06e6f9229bf6455ca03f9ce92384d And I used same FF 130.0 on GNU/Linux.

I can't reproduce the issue either, please advise @hfiguiere

hfiguiere commented 1 month ago

The Python sample is possibly likely broken, I have never touched/tested it.

The drupal connector is https://github.com/CollaboraOnline/collabora-drupal/

sgothel commented 1 month ago

From your log file .. ERR:

issue-10042$ grep ERR coolwsd.log
wsd-3223553-3223553 2024-09-13 09:49:43.969691 -0400 [ coolwsd ] ERR  Log level is set very high to 'trace' this will have a significant performance impact. Do not use this in production.| wsd/COOLWSD.cpp:4459
wsd-3223553-3223624 2024-09-13 09:49:43.975092 -0400 [ docbroker_001 ] ERR  Failed to stat filesystem [/home/hub/Documents/Collabora/cool-dev/online/jails/3223553-04d88ac2/tmp/cool-uu24BbZsD3k7eC3V/user/docs/0pMi3kARScd5YPxP/hello-world.odp] (ENOENT: No such file or directory)| common/FileUtil.cpp:544
kit-3223619-3223619 2024-09-13 09:49:43.994278 -0400 [ kitbroker_001 ] ERR  non-async dialog triggered| kit/Kit.cpp:2713
kit-3223619-3223619 2024-09-13 09:49:44.035768 -0400 [ kitbroker_001 ] ERR  non-async dialog triggered| kit/Kit.cpp:2713
wsd-3223553-3223622 2024-09-13 09:51:51.982533 -0400 [ websrv_poll ] TRC  Field [checkfileinfo_override] for var [%CHECK_FILE_INFO_OVERRIDE%] = []| wsd/FileServer.cpp:1085
wsd-3223553-3223622 2024-09-13 09:51:52.467312 -0400 [ websrv_poll ] ERR  FileServerRequestHandler: File not found: Invalid URI request (hash): [/browser/44cbb47116/branding.css].| wsd/FileServer.cpp:751
wsd-3223553-3223622 2024-09-13 09:51:52.467350 -0400 [ websrv_poll ] ERR  #37: Attempted to remove: 396 which is > size: 0 clamped to 0| net/Socket.hpp:1264
wsd-3223553-3223592 2024-09-13 09:51:54.674191 -0400 [ prisoner_poll ] ERR  Forkit waitpid failed (ECHILD: No child processes)| wsd/COOLWSD.cpp:3396

.. so right after the timeout, the ERR FileServerRequestHandler: File not found: Invalid URI request (hash) occurred .. FileServer.cpp:751

.. and even before the timeout, a filestat error of an actual (local) file URI (not WOPI)

mmeeks commented 1 month ago

wsd-3223553-3223622 2024-09-13 09:51:52.467350 -0400 [ websrv_poll ] ERR #37: Attempted to remove: 396 which is > size: 0 clamped to 0| net/Socket.hpp:1264

Looks superficially problematic too I guess (?) should be able to see how we got our arithmetic tangled up there ?

caolanm commented 1 month ago

I guess that eraseFirstInputBytes is the call of:

//remove the request body from our input buffer socket->eraseFirstInputBytes(map._messageSize - map._headerSize);

Where map._messageSize is map._headerSize + contentLength where contentLength comes from request.getContentLength which does stoll on the "Content-Length" line, so one possibility is a incorrect/lying Content-Length?

Or if we've already consumed some of the content that we're trying to skip

sgothel commented 1 month ago

Right, emphasized by @caolanm as well.

Probably coming from ClientRequestDispatcher::handleIncomingMessage -> Socket::eraseFirstInputBytes, our known 'friendly issue' (removed shutdown on POST file req ..)

sgothel commented 1 month ago

now you beat me to it @caolanm :) right .. have to check whether content-length in parseHeader is properly validated .. AFAIK it bails-out if having less bytes available. Edit2: StreamSocket::parseHeader code checks available < contentLength and sets map._messageSize += contentLength.

still like to reproduce the issue first, will check with the mentioned drupal setup.

sgothel commented 1 month ago

so the sequence from parseHeader .. seems to match (somehow buggy: header size 396 and content-length -1 (invalid)):

wsd-3223553-3223622 2024-09-13 09:51:52.467268 -0400 [ websrv_poll ] INF  #37: parseHeader: Client HTTP Request: GET, uri `/browser/44cbb47116/branding.css` HTTP/1.1, sz[header 396, content -1], offset 396, chunked false, Host: 172.18.2.14:9980 / User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0 / Accept: text/css,*/*;q=0.1 / Accept-Language: en-CA,en-US;q=0.7,en;q=0.3 / Accept-Encoding: gzip, deflate, br, zstd / DNT: 1 / Connection: keep-alive / Sec-Fetch-Dest: style / Sec-Fetch-Mode: no-cors / Sec-Fetch-Site: same-origin / Priority: u=2| net/Socket.cpp:1340
wsd-3223553-3223622 2024-09-13 09:51:52.467274 -0400 [ websrv_poll ] DBG  #37: Handling request: /browser/44cbb47116/branding.css| wsd/ClientRequestDispatcher.cpp:670
wsd-3223553-3223622 2024-09-13 09:51:52.467282 -0400 [ websrv_poll ] TRC  Fileserver request: /browser/44cbb47116/branding.css| wsd/FileServer.cpp:522
wsd-3223553-3223622 2024-09-13 09:51:52.467312 -0400 [ websrv_poll ] ERR  FileServerRequestHandler: File not found: Invalid URI request (hash): [/browser/44cbb47116/branding.css].| wsd/FileServer.cpp:751
wsd-3223553-3223622 2024-09-13 09:51:52.467335 -0400 [ websrv_poll ] TRC  #37: Wrote 323 bytes of 323 buffered data| net/Socket.hpp:1490
wsd-3223553-3223622 2024-09-13 09:51:52.467350 -0400 [ websrv_poll ] ERR  #37: Attempted to remove: 396 which is > size: 0 clamped to 0| net/Socket.hpp:1264

Edit: So this eraseFirstInputBytes() probably comes from

    // If we succeeded - remove the request header from our input buffer
    socket->eraseFirstInputBytes(map._headerSize);

since the latter one map._messageSize - map._headerSize should be zero and not show such a message?

Edit3: Yes, b/c parseHeader sets map._messageSize = map._headerSize initially and since contentLength == -1 it won't get added to map._messageSize. So 0 == map._messageSize - map._headerSize. Hence the 1st socket->eraseFirstInputBytes(map._headerSize) call triggered.

Edit2:

Edit3: brings me back to need to reproduce this issue first .. will continue.

sgothel commented 1 month ago

The StreamScoket::eraseFirstInputBytes error log occurs as described above due do the already cleared _inBuffer. The latter happened due to FileServerRequestHandler::handlerRequest() line 751 (File not found /browser/6e7c0cb589/branding.css) which can be reproduced in the standalone make run.

Here HttpHelper::sendErrorAndShutdown gets called which signals socket for async shutdown and ignores input, also clearing the _inBuffer. Either we could silence this misleading log or have the caller ClientRequestDispatcher::handleIncomingMessage determine whether the socket is already about to shutdown and/or input-ignored. The latter is not so clear in current Socket implementation.

However, the above does not explain the 2min lag, which I also could not reproduce with given tests.

sgothel commented 1 month ago

Tried to test with https://github.com/CollaboraOnline/collabora-drupal/

sgothel commented 1 month ago
wsd-3578939-3578966 2024-09-16 12:30:09.020552 +0200 [ websrv_poll ] DBG  #19: Handling request: //hosting/discovery| wsd/ClientRequestDispatcher.cpp:670
wsd-3578939-3578966 2024-09-16 12:30:09.020582 +0200 [ websrv_poll ] DBG  #19: Wopi discovery request: //hosting/discovery| wsd/ClientRequestDispatcher.cpp:1023
wsd-3578939-3578966 2024-09-16 12:30:09.020650 +0200 [ websrv_poll ] TRC  #19: Sending back discovery.xml: <wopi-discovery>
    <net-zone name="external-http">
....
</wopi-discovery>| wsd/ClientRequestDispatcher.cpp:1043
wsd-3578939-3578966 2024-09-16 12:30:09.020730 +0200 [ websrv_poll ] INF  #19: Sent discovery.xml successfully.| wsd/ClientRequestDispatcher.cpp:1045
wsd-3578939-3578966 2024-09-16 12:32:09.138724 +0200 [ websrv_poll ] DBG  #19: Closed after reading| net/Socket.hpp:1375
wsd-3578939-3578966 2024-09-16 12:32:09.138754 +0200 [ websrv_poll ] TRC  #19: Closed. Firing onDisconnect.| net/Socket.hpp:1447

@hfiguiere .. looks like this is your 2 minute lag .. yes, drupal seems to load the discovery from COOL but for some reason 2 minutes are used (timeout) between successfully sent the data to closing the connection. Could it be that the drupal code reads until close instead of using content-length (as described earlier)? Our code change keeps the connection open, hence content-length should be utilized.

hfiguiere commented 1 month ago

Looks like the root cause might be caused by https://github.com/CollaboraOnline/collabora-drupal/pull/29/files#diff-bb749b39143d95341efb10d136792f64797f016b12158902a9ec1dddcf562cddR28

hfiguiere commented 1 month ago

Applying the change from the above patch on the Drupal module side no longer trigger the problem.

sgothel commented 1 month ago

Looks like the root cause might be caused by https://github.com/CollaboraOnline/collabora-drupal/pull/29/files#diff-bb749b39143d95341efb10d136792f64797f016b12158902a9ec1dddcf562cddR28

They wrote ..

it can happen that file_get_contents() will hang at the end of a stream, expecting more data. With curl, this does not happen.

indicating that our keep alive connections change indeed were the cause, i.e. not expecting more data but we didn't close and the drupal client also didn't close after receiving the data. With curl this doesn't happen...

Great .. we might find more client issues with this behavioral change.

IMHO this issue has been fully understood now and we can close it? OK?

hfiguiere commented 1 month ago

IMHO this issue has been fully understood now and we can close it? OK?

We need to make sure it's not triggered.

BTW file_get_contents() is not Drupal, it's PHP.

sgothel commented 1 month ago

BTW file_get_contents() is not Drupal, it's PHP.

i.e. file_get_contents() .. thx.

We identified the request as an HTTP/1.1 GET

parseHeader: Client HTTP Request: GET, uri //hosting/discovery HTTP/1.1, sz[header 77, content -1], offset 77, chunked false, Host: localhost:9980 / Connection: close| net/Socket.cpp:1340

and indeed it looks like we do not respect Connection: close, i.e. we do not close it.

The current workaround is functional, but as you hinted, expectation is different with these APIs (not using content-length but a closed connection). Will provide a patch, thank you!

sgothel commented 1 month ago

here my pull request .. resolves the issue for me here. https://github.com/CollaboraOnline/online/pull/10054