change-metrics / monocle

Monocle helps teams and individual to better organize daily duties and to detect anomalies in the way changes are produced and reviewed.
https://changemetrics.io
GNU Affero General Public License v3.0
377 stars 58 forks source link

Errors when running crawler behind corporate proxy #1094

Closed azat-alimov-db closed 11 months ago

azat-alimov-db commented 11 months ago

Hello,

Thank you for helping out with a question about proxy settings for crawler. Now when I'm trying to run a test indexing, I'm getting the following error: 2023-12-12 21:01:48 WARNING Macroscope.Main:317: Skipping due to an unexpected exception {"index":"test","crawler":"coder","err":"Decoding of CommitInfoRequest {commitInfoRequestIndex = \"test\", commitInfoRequestCrawler = \"coder\", commitInfoRequestEntity = Enumerated {enumerated = Right EntityTypeENTITY_TYPE_ORGANIZATION}, commitInfoRequestOffset = 0} failed with: \"Error in $: Failed reading: not a valid json value at '<!DOCTYPEhtmlPUBLIC-W3CDTDXHTML1.0TransitionalENhttp:www.w3.orgTRxhtml1DTDxhtm'\"\nCallStack (from HasCallStack):\n error, called at src/Relude/Debug.hs:289:11 in relude-1.2.0.0-Jiwa4gfuZvkK1snRof3V:Relude.Debug\n error, called at src/Monocle/Client.hs:107:17 in monocle-0.1.10.0-1juCsBb4vJ35WvYo0D138g:Monocle.Client"}

Here is a config: workspaces:

Any idea what that would mean? Appreciate any hints

TristanCacqueray commented 11 months ago

It seems like this is happening when the crawler uses the proxy to connect to the api. We probably need a different variable name for that case.

azat-alimov-db commented 11 months ago

That what we set as a proxy setting:

TristanCacqueray commented 11 months ago

Could you try removing the HTTP_PROXY variable, it should be the one used for the connections from the crawler to the api.

azat-alimov-db commented 11 months ago

Ok, tried, looks like the connection can be established now, but getting SSL errors: 2023-12-12 21:55:59 WARNING Monocle.Effects:526: network error {"index":"test","crawler":"coder","stream":"Projects","count":7,"limit":7,"loc":"api.github.com:443/graphql","failed":"InternalException ProtocolError \"error:0A000086:SSL routines::certificate verify failed\""}

Any hints on configuring SSL certs for crawler (since there is a replacement of SSL cert with our org signed certificate, when going through the proxy) or maybe any way to run crawler in insecure mode?

TristanCacqueray commented 11 months ago

Alright, thanks.

SSL is implemented by openssl, so setting SSL_CERT_FILE should work.

azat-alimov-db commented 11 months ago

gotcha, thank you. I'll work on that tomorrow, since will need to update a deployment yaml and mount ssl certs somewhere as a secret

morucci commented 11 months ago

The related change is merged. New container image should be published soon. https://github.com/change-metrics/monocle/actions/runs/7199715334

azat-alimov-db commented 11 months ago

Hello,

I added a certificate to a deployment and set the env var to:

- name: SSL_CERT_FILE
value: /etc/pki/tls/certs/db-server-ca-6.cer

Then tested with curl and connection works fine via proxy:

bash-4.2$ curl -v -o /dev/null https://api.github.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to proxy *** port 8080 (#0)
*   Trying 10.245.32.5...
* Connected to *** (10.245.32.5) port 8080 (#0)
* Establish HTTP proxy tunnel to api.github.com:443
> CONNECT api.github.com:443 HTTP/1.1
> Host: api.github.com:443
> User-Agent: curl/7.29.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.0 200 Connection established
< 
* Proxy replied OK to CONNECT request
* Initializing NSS with certpath: sql:/etc/pki/nssdb
*   CAfile: /etc/pki/tls/certs/db-server-ca-6.cer
  CApath: none
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
***
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: api.github.com
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 13 Dec 2023 19:44:29 GMT
< ETag: W/"4f825cc84e1c733059d46e76e6df9db557ae5254f9625dfe8e1b09499c449438"
< Vary: Accept, Accept-Encoding, Accept, X-Requested-With
< Server: GitHub.com
< Connection: Keep-Alive
< Content-Type: application/json; charset=utf-8
< Accept-Ranges: bytes
< Cache-Control: public, max-age=60, s-maxage=60
< Content-Length: 2262
< Referrer-Policy: origin-when-cross-origin, strict-origin-when-cross-origin
< X-Frame-Options: deny
< X-RateLimit-Used: 1
< X-XSS-Protection: 0
< X-RateLimit-Limit: 60
< X-RateLimit-Reset: 1702500276
< X-GitHub-Media-Type: github.v3; format=json
< X-GitHub-Request-Id: 5F6A:3D26CA:1FEADE:204AFE:657A09A4
< X-RateLimit-Resource: core
< X-RateLimit-Remaining: 59
< X-Content-Type-Options: nosniff
< Content-Security-Policy: default-src 'none'
< Strict-Transport-Security: max-age=31536000; includeSubdomains; preload
< Access-Control-Allow-Origin: *
< Access-Control-Expose-Headers: ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset
< x-github-api-version-selected: 2022-11-28
< 
{ [data not shown]
100  2262  100  2262    0     0   5835      0 --:--:-- --:--:-- --:--:--  5844
* Connection #0 to host *** left intact

But crawler still receives the following: "2023-12-13 18:46:49 WARNING Macroscope.Main:317: Skipping due to an unexpected exception {"index":"test","crawler":"coder","err":"HttpExceptionRequest Request {\n host = \"api.github.com\"\n port = 443\n secure = True\n requestHeaders = [(\"Authorization\",\"<REDACTED>\"),(\"User-Agent\",\"change-metrics/monocle\"),(\"Content-Type\",\"application/json\")]\n path = \"/graphql\"\n queryString = \"\"\n method = \"POST\"\n proxy = Nothing\n rawBody = False\n redirectCount = 10\n responseTimeout = ResponseTimeoutDefault\n requestVersion = HTTP/1.1\n proxySecureMode = ProxySecureWithConnect\n}\n (InternalException ProtocolError \"error:0A000086:SSL routines::certificate verify failed\")"}"

Is it possible set it to insecure?

Appreciate any further suggestions

TristanCacqueray commented 11 months ago

Perhaps you can try setting TLS_NO_VERIFY to 1

azat-alimov-db commented 11 months ago

ah, looks like it is TLS_NO_VERIFY variable, as per: https://github.com/change-metrics/monocle/blob/659e4c319b3b6c37777ae692952c7250448e7319/src/Monocle/Client.hs#L47C28-L47C41

azat-alimov-db commented 11 months ago

Any idea why I get the a "Network error" from web UI (api), when trying to access it via browser

Logs of api service not throwing any suspicious errors and moreover it that I received 200: [13/Dec/2023:20:17:46 +0000] "GET / HTTP/1.1" 200 - "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36""

I've exposed the service via Cloud Load Balancer on GCP GKE, with LoadBalancer service type:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: api
    app.kubernetes.io/component: api
    app.kubernetes.io/part-of: monocle
  name: api-external
  annotations:
    networking.gke.io/internal-load-balancer-allow-global-access: "true"
    networking.gke.io/load-balancer-type: "Internal"
spec:
  type: LoadBalancer
  ports:
    - name: http-rest-api
      port: 8080
      targetPort: 8080
  selector:
    app.kubernetes.io/name: api
status:
  loadBalancer: {}
TristanCacqueray commented 11 months ago

Have you try setting COMPOSE_MONOCLE_PUBLIC_URL ?

azat-alimov-db commented 11 months ago

yep, set that for api and crawler, but still getting the same "Network error" message

TristanCacqueray commented 11 months ago

Oops I meant MONOCLE_PUBLIC_URL, this should be the url you are using to access the web UI, it is only needed for the api container. It defaults to localhost, so if you look in your browser network inspect tab, you should see that the network error message happens because the client tries to connect to localhost.

azat-alimov-db commented 11 months ago

Sorry can't give you a screenshots, but while looking into Chrome developer tools, I see the following for "about" request: General:

Request URL:
http://localhost:8080/api/2/about
Referrer Policy:
strict-origin-when-cross-origin

Request Headers:

Accept:
*/*
Access-Control-Request-Headers:
content-type
Access-Control-Request-Method:
POST
Origin:
http://100.88.10.138:8080
Sec-Fetch-Mode:
cors
User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
azat-alimov-db commented 11 months ago

Awesome, that did the trick. Thank you very much @TristanCacqueray ! Let me play around with that great tool.

Feel free to close this issue record

TristanCacqueray commented 11 months ago

You're welcome, have fun!