SockJS issue through many layers of proxying (debugging)

hexylena commented 7 years ago

I'm hitting this when trying to deploy to rancher. Opening this issue mostly to track my debugging.

My setup looks like:

User makes request to https://fqdn/apollo-dev/
Apache on external machine routes to http://localhost:9999/apollo-dev/ (lb)
lb on localhost is actually in a container, this proxies to http://:8080/apollo-dev/

Currently not sure if websockets are functional at all, not sure how I'd test that. It fails before that. Wonder if I can "get away" with x-forwarded-for type headers and that will fix this?

apollo container: can curl OK
- curl 'http://localhost:8080/apollo-dev/stomp/084/yf3adjwl/xhr' -X POST -H 'Accept: */*' -H 'Cookie: JSESSIONID=......;' -i | head
- 200/OK, 'o' in body.
lb container: curl FAIL.
- curl 'http://10.42.182.231:8080/apollo-dev/stomp/084/yf3adjwl/xhr' -X POST -H 'Accept: */*' -H 'Cookie: JSESSIONID=....;' -i
- 500

12/2/2016 12:31:47 AM2016-12-02 00:31:47,383 [http-apr-8080-exec-6] ERROR errors.GrailsExceptionResolver  - IllegalArgumentException occurred when processing request: [POST] /apollo-dev/stomp/987/g0noem66/xhr_streaming
12/2/2016 12:31:47 AMhostname can't be null. Stacktrace follows:
12/2/2016 12:31:47 AMorg.springframework.web.socket.sockjs.SockJsException: Uncaught failure in SockJS request, uri=http://localhost:9999/apollo-dev/stomp/987/g0noem66/xhr_streaming; nested exception is org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/987/g0noem66/xhr_streaming; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
12/2/2016 12:31:47 AM   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
12/2/2016 12:31:47 AM   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
12/2/2016 12:31:47 AM   at java.lang.Thread.run(Thread.java:745)
12/2/2016 12:31:47 AMCaused by: org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/987/g0noem66/xhr_streaming; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   ... 10 more
12/2/2016 12:31:47 AMCaused by: java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   at java.net.InetSocketAddress.checkHost(InetSocketAddress.java:149)
12/2/2016 12:31:47 AM   at java.net.InetSocketAddress.<init>(InetSocketAddress.java:216)
12/2/2016 12:31:47 AM   ... 10 more
12/2/2016 12:31:47 AM2016-12-02 00:31:47,497 [http-apr-8080-exec-10] ERROR errors.GrailsExceptionResolver  - IllegalArgumentException occurred when processing request: [POST] /apollo-dev/stomp/987/colp8zq1/xhr
12/2/2016 12:31:47 AMhostname can't be null. Stacktrace follows:
12/2/2016 12:31:47 AMorg.springframework.web.socket.sockjs.SockJsException: Uncaught failure in SockJS request, uri=http://localhost:9999/apollo-dev/stomp/987/colp8zq1/xhr; nested exception is org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/987/colp8zq1/xhr; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
12/2/2016 12:31:47 AM   at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
12/2/2016 12:31:47 AM   at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
12/2/2016 12:31:47 AM   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
12/2/2016 12:31:47 AM   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
12/2/2016 12:31:47 AM   at java.lang.Thread.run(Thread.java:745)
12/2/2016 12:31:47 AMCaused by: org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/987/colp8zq1/xhr; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   ... 10 more
12/2/2016 12:31:47 AMCaused by: java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 12:31:47 AM   at java.net.InetSocketAddress.checkHost(InetSocketAddress.java:149)
12/2/2016 12:31:47 AM   at java.net.InetSocketAddress.<init>(InetSocketAddress.java:216)
12/2/2016 12:31:47 AM   ... 10 more

nathandunn commented 7 years ago

So, if you are using tomcat7, you have to use the absolute most recent version, and the default Ubuntu 12 / 14 tomcat7 is typically not recent enough to support websockets. Sorry, should have thought of that sooner. You can drop a jar in, or use the tomcat8 container, which should work.

That being said, if you don't have it installed it will fall back to regular long-polling. The web-services will be fine with or web sockets. You can tell because in the browser you'll see something like this:

screen shot 2016-12-01 at 5 00 07 pm

Also, have you looked at https://dockstore.org/? .. not sure what the scope of your project is.

hexylena commented 7 years ago

We're using tomcat 8
The problem doesn't seem to be fallback, it seems to be that websockets fail, and then all of the fallback methods trigger a 500 when SockJS thinks the hostname is invalid.

utvalg_025

Dockerstore is something quite different, they're just for packaging up bio-software in containers (of which there are dozens of those types of projects, "bioboxes", "biodckr", mulled), not something my org cares about. We're focused on deployment of complex services like apollo / full gmod suite.
Rancher is for orchestration. Think kubernetes / docker swarm + web ui.

utvalg_026 utvalg_027

nathandunn commented 7 years ago

If you’re getting 500 errors, it must be failing on the server-side. The first 400 sometimes happens as the interface is refreshing itself.

Look in the catalina.out logs and see if anything is showing up.

Nathan

On Dec 1, 2016, at 5:09 PM, Eric Rasche notifications@github.com wrote:

We're using tomcat 8 The problem doesn't seem to be fallback, it seems to be that websockets fail, and then all of the fallback methods trigger a 500 when SockJS thinks the hostname is invalid. https://cloud.githubusercontent.com/assets/458683/20819308/6b280744-b82b-11e6-9897-1a1b76bebb32.png Dockerstore is something quite different, they're just for packaging up bio-software in containers (of which there are dozens of those types of projects, "bioboxes", "biodckr", mulled https://github.com/mulled/auto-mulled/) Rancher is for orchestration. Think kubernetes / docker swarm + web ui. https://cloud.githubusercontent.com/assets/458683/20819399/eff4f1a8-b82b-11e6-95de-50d5a8fa7286.png https://cloud.githubusercontent.com/assets/458683/20819405/002753d6-b82c-11e6-9478-a3fefd01efc6.png — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GMOD/Apollo/issues/1358#issuecomment-264345007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAt2qnt3Yy3Jf5s5M2vhIyati9mubsOpks5rD2_bgaJpZM4LCECn.

hexylena commented 7 years ago

Catalina logs. I'm not seeing the 400 bad request that's specific to the websockets, but I do see the 500s.

12/2/2016 1:12:52 AMhostname can't be null. Stacktrace follows:
12/2/2016 1:12:52 AMorg.springframework.web.socket.sockjs.SockJsException: Uncaught failure in SockJS request, uri=http://localhost:9999/apollo-dev/stomp/294/_pjn08x7/xhr_streaming; nested exception is org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/294/_pjn08x7/xhr_streaming; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 1:12:52 AM    at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
12/2/2016 1:12:52 AM    at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
12/2/2016 1:12:52 AM    at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
12/2/2016 1:12:52 AM    at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
12/2/2016 1:12:52 AM    at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
12/2/2016 1:12:52 AM    at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
12/2/2016 1:12:52 AM    at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
12/2/2016 1:12:52 AM    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
12/2/2016 1:12:52 AM    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
12/2/2016 1:12:52 AM    at java.lang.Thread.run(Thread.java:745)
12/2/2016 1:12:52 AMCaused by: org.springframework.web.socket.sockjs.SockJsException: Uncaught failure for request http://localhost:9999/apollo-dev/stomp/294/_pjn08x7/xhr_streaming; nested exception is java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 1:12:52 AM    ... 10 more
12/2/2016 1:12:52 AMCaused by: java.lang.IllegalArgumentException: hostname can't be null
12/2/2016 1:12:52 AM    at java.net.InetSocketAddress.checkHost(InetSocketAddress.java:149)
12/2/2016 1:12:52 AM    at java.net.InetSocketAddress.<init>(InetSocketAddress.java:216)
12/2/2016 1:12:52 AM    ... 10 more
12/2/2016 1:12:52 AM2016-12-02 01:12:52,521 [http-apr-8080-exec-1] ERROR errors.GrailsExceptionResolver  - IllegalArgumentException occurred when processing request: [POST] /apollo-dev/stomp/294/4dc58men/xhr

nathandunn commented 7 years ago

https://forums.mulesoft.com/questions/4456/hostname_cant_be_null_java_lang_illegalargumentexception.html

nathandunn commented 7 years ago

https://github.com/brianfrankcooper/YCSB/issues/105

I don't think its an Apollo issue per se.

hexylena commented 7 years ago

Yeah, both of those came up in my searches, but neither yielded useful solutions.

As mentioned in first comment, I can make the request from localhost (i.e. apollo container), but as soon as I get one container away, I cannot and I'm not sure why.

hexylena commented 7 years ago

This likely isn't an apollo issue, it's likely a configuration problem somewhere in the stack. E.g. a specific hostname needs to be provided somewhere, some magic flag needs to be set on one of the proxies, etc. Opened issue to track debugging + in case I had documentation if/when I fixed this that other people could make use of.

hexylena commented 7 years ago

Hmm, from localhost, it works if it goes to 127.0.0.1. But if the request is made to a different IP address associated with the same machine, it fails. (apollo.apollo resolves to same server as localhost)

root@cea19e6e30ee:/opt# curl 'http://apollo.apollo:8080/apollo-dev/stomp/084/yf3adjwl/xhr' -X POST  -H 'Accept: */*' -H ';Cookie: JSESSIONID=2FAF>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2510    0  2510    0     0   237k      0 --:--:-- --:--:-- --:--:--  245k
HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=UTF-8
Content-Language: en
Transfer-Encoding: chunked
Date: Fri, 02 Dec 2016 17:20:08 GMT
Connection: close

<!DOCTYPE html>
<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6"> <![endif]-->
root@cea19e6e30ee:/opt# curl 'http://localhost:8080/apollo-dev/stomp/084/yf3adjwl/xhr' -X POST  -H 'Accept: */*' -H ';Cookie: JSESSIONID=2FAF1C37>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    385      0 --:--:-- --:--:-- --:--:--   400
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Cache-Control: no-store, no-cache, must-revalidate, max-age=0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Content-Type: application/javascript;charset=UTF-8
Content-Length: 2
Date: Fri, 02 Dec 2016 17:20:15 GMT

o

hexylena commented 7 years ago

Adding

RewriteRule ^/apollo-dev/stomp/(.*)/websocket$ wss://localhost:9999/apollo-dev/stomp/$1/websocket [P,L]

in my apache conf helps. I'm really struggling to understand why it's wss when nothing outside of our proxy server has access to SSL certs. Surely anything sent over that protocol would fail because there's no access to certs? Ah well, it still fails:

utvalg_032

but it's lies

utvalg_033

and the XHR fallback still fails.

hexylena commented 7 years ago

Also http://stackoverflow.com/questions/28385798/host-name-cant-be-null-using-grails-spring-websocket-plugin seems exactly right but that answer is beyond useless. I've tried setting the hostname to all manner of things and nothing good comes of it.

hexylena commented 7 years ago

Update: Spent a long time wiring up JMX and mucking with that. No joy there either.

abretaud commented 7 years ago

I just saw this issue, do you still have proxy/websocket problems? I have a working setup that looks like this:

internet ---> apache 2.4 proxy ---> nginx proxy ---> another nginx proxy --> apollo

I can share my config if it can help

hexylena commented 7 years ago

@abretaud haven't tested with the latest image which is now based on tomcat:8, so that might be an improvement. I'll do a test deployment here soon.

For us the chain is: internet → apache → haproxy → (docker networking) → apollo

And intuition says it might be haproxy but I don't have any way to test websockets easily. (I really wish people would post demo client / servers for testing these stupid new protocols).

Thanks for the offer, I'll ping you / this issue if I'm still experiencing it. I really, really want apollo deployed on rancher so it isn't separaetly managed since that's painful for me.

abretaud commented 7 years ago

Ok, no problem (though I've never used haproxy) +100 for the painfull websocket testing!

nathandunn commented 7 years ago

@abretaud / @erasche I'm sure you've seen this, but wanted to repost. The default Ubuntu 14 version of tomcat 7 would not have worked (the more recent stable versions do). The long-polling fallback will work fine, though its not ideal (though I don't you're users would notice unless you have weird firewall rules).

Configuring with haproxy for websockets looks like it may have been tricky. Have you tried removing haproxy from the equation?

I guess you saw my comment above about how to confirm that they are working. You just have to watch the "frames" tab.

hexylena commented 7 years ago

@nathandunn. We don't use ubuntu14. We (used to) run tomcat:7 (which defaults to jre7), we don't use the ubuntu images as they make for gigantic docker images.

Configuring with haproxy for websockets looks like it may have been tricky. Have you tried removing haproxy from the equation?

Not possible. It's heavily tied to rancher. However haproxy can trivially proxy even mysql connections, so I'm somewhat dubious that it's really to blame and not SockJS + java.

The long-polling fallback will work fine, though its not ideal (though I don't you're users would notice unless you have weird firewall rules).

As you can see from the screenshot in https://github.com/GMOD/Apollo/issues/1358#issuecomment-264345007, that's not quite the case. We were having internal server errors during the fallback. Hence me thinking it was sockjs/java. Maybe the proxy was stripping a header / adding a header that caused issues.

Hopefully the update to tomcat:8-jre8 fixes this, I should know later today.

hexylena commented 7 years ago

After the upgrade (still testing my image, has a bunch of other local changes), this is solved! :joy:

Websockets still aren't proxying right, but hey, who cares, fallback works, I can move apollo off my frontend machine and on to compute infrastructure, and I can have my remote user stuff working now.

Thanks for debugging input everyone.

nathandunn commented 7 years ago

Weird. Did you explicitly setup apache or nginx proxy (we have some doc on this if not) or are you going straight through tomcat? Or you think its still the haproxy?

Anyway, glad it worked one way or another.

hexylena commented 7 years ago

It wasn't any explicit changes, just the change to 1) more recent version of apollo (we were on 2.0.3? 4?) + 2) tomcat8/jre8

We hit apache and haproxy: internet → apache → haproxy → (docker SDN) → apollo.

We actually have nginx in the route as well, but that's just for a special case of API access. There the route goes internet → apache → haproxy → (docker SDN) → nginx to rewrite paths → apollo, because debugging direct network connections wasn't fun enough ;)

The apache proxy hasn't changed at all, still using pretty standard proxying rules (proxypass, and wstunnel, quite similar to what your docs describe.) That works perfectly with the direct connection, so I'll make some efforts to figure out why that doesn't work over my proxy setup some other. Just happy to have made progress.

GMOD / Apollo

SockJS issue through many layers of proxying (debugging) #1358