bytedance / g3

Enterprise-oriented Generic Proxy Solutions
Apache License 2.0
444 stars 34 forks source link

Issue getting upstream cert: no fake cert generated: failed to get fake upstream certificate #282

Closed mspublic closed 1 month ago

mspublic commented 1 month ago

I have noticed for certain sites there are issues getting the upstream cert/the browser does not connect. If I let it sit for a minute then refresh it works.

I am not sure the exact reason for this. But a potential workaround would be to fallback to generating a basic cert just with the SNI.

Aug 15 09:26:13.185 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=33238, task_id=537cc6ff5b1211ef92dec87d33fb60f1, upstream=login.microsoftonline.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 15 09:26:13.185 INFO c_rd_bytes=670, c_wr_bytes=0, client_addr=127.0.0.1:62715, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:62716, next_peer_addr=40.126.29.5:443, pid=33238, r_rd_bytes=4404, r_wr_bytes=974, ready_time=16.918ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T14:26:13.124661Z, task_id=537cc6ff5b1211ef92dec87d33fb60f1, task_type=TcpConnect, tcp_connect_spend=15.827ms, tcp_connect_tries=1, total_time=60.551ms, upstream=login.microsoftonline.com:443, wait_time=111.667µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate
zh-jq commented 1 month ago

Do you get any error message from the output of g3fcgen?

mspublic commented 1 month ago

No errors in g3fcgen stdout/err.

zh-jq commented 1 month ago

I did a local test with curl -x 127.0.0.1:8080 https://login.microsoftonline.com --insecure -v and didn't found any problems.

mspublic commented 1 month ago

It's random on when it happens. But it will typically happens somewhere in the flow logging into office365/word online

mspublic commented 1 month ago

It looks like its failing out before g3fcgen is able to get the certificate. The error timestamp in g3proxy is earlier than the debug output of g3fcgen.

Aug 15 09:50:12.050 DEBG Worker#0 got certificate for host usc-word-edit.officeapps.live.com <g3fcgen::backend(mod.rs:166)>
Aug 15 09:50:35.087 DEBG Worker#0 got certificate for host usc-word-edit.officeapps.live.com <g3fcgen::backend(mod.rs:166)>
Aug 15 09:50:12.023 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=33986, task_id=ac9a29ff5b1511ef8e87ef3b91191f4d, upstream=usc-word-edit.officeapps.live.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 15 09:50:12.023 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=33986, task_id=ac9a29ff5b1511ef8e88ef3b91191f4d, upstream=usc-word-edit.officeapps.live.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 15 09:50:12.023 INFO c_rd_bytes=678, c_wr_bytes=0, client_addr=127.0.0.1:63759, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:63763, next_peer_addr=52.108.9.12:443, pid=33986, r_rd_bytes=6892, r_wr_bytes=675, ready_time=27.504ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T14:50:11.558343Z, task_id=ac9a29ff5b1511ef8e87ef3b91191f4d, task_type=TcpConnect, tcp_connect_spend=12.218ms, tcp_connect_tries=1, total_time=464.848ms, upstream=usc-word-edit.officeapps.live.com:443, wait_time=121.000µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate
Aug 15 09:50:12.023 INFO c_rd_bytes=678, c_wr_bytes=0, client_addr=127.0.0.1:63760, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:63762, next_peer_addr=52.108.9.12:443, pid=33986, r_rd_bytes=6892, r_wr_bytes=675, ready_time=11.138ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T14:50:11.572303Z, task_id=ac9a29ff5b1511ef8e88ef3b91191f4d, task_type=TcpConnect, tcp_connect_spend=9.988ms, tcp_connect_tries=1, total_time=450.889ms, upstream=usc-word-edit.officeapps.live.com:443, wait_time=102.291µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate
Aug 15 09:50:12.023 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=33986, task_id=ac9a29ff5b1511ef8e86ef3b91191f4d, upstream=usc-word-edit.officeapps.live.com:443, no fake cert generated: failed to get fake upstream certificate
zh-jq commented 1 month ago

Can you try by setting query_wait_timeout: 4s in tls_cert_generator config?

zh-jq commented 1 month ago

Or you can add more working threads to g3fcgen by setting worker: {} in it's main.yaml.

mspublic commented 1 month ago

I set both and it still fails immediately. Doesn't seem to wait for timeout.

Aug 15 10:38:35.650 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=34813, task_id=6eeccf7f5b1c11ef9d648a8e6879accb, upstream=www.google.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 15 10:38:35.650 INFO c_rd_bytes=659, c_wr_bytes=0, client_addr=127.0.0.1:65345, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:65346, next_peer_addr=142.250.114.99:443, pid=34813, r_rd_bytes=4104, r_wr_bytes=581, ready_time=23.381ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T15:38:34.806586Z, task_id=6eeccf7f5b1c11ef9d648a8e6879accb, task_type=TcpConnect, tcp_connect_spend=9.081ms, tcp_connect_tries=1, total_time=844.278ms, upstream=www.google.com:443, wait_time=308.625µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate
mspublic commented 1 month ago

Also as an aside - should I be using worker or runtime to define the thread numbers? In the example it had runtime. But seems that may not be right. I tester with both without success.

runtime:
  thread_number: 2

I also tested with:

worker:
  thread_number: 2
zh-jq commented 1 month ago

I set both and it still fails immediately. Doesn't seem to wait for timeout.

Aug 15 10:38:35.650 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=34813, task_id=6eeccf7f5b1c11ef9d648a8e6879accb, upstream=www.google.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 15 10:38:35.650 INFO c_rd_bytes=659, c_wr_bytes=0, client_addr=127.0.0.1:65345, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:65346, next_peer_addr=142.250.114.99:443, pid=34813, r_rd_bytes=4104, r_wr_bytes=581, ready_time=23.381ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T15:38:34.806586Z, task_id=6eeccf7f5b1c11ef9d648a8e6879accb, task_type=TcpConnect, tcp_connect_spend=9.081ms, tcp_connect_tries=1, total_time=844.278ms, upstream=www.google.com:443, wait_time=308.625µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate

You can capture the traffic on udp port 2999 to check the response latency

zh-jq commented 1 month ago

Also as an aside - should I be using worker or runtime to define the thread numbers? In the example it had runtime. But seems that may not be right. I tester with both without success.

runtime:
  thread_number: 2

I also tested with:

worker:
  thread_number: 2

the first one will set the thread number for handle of udp traffic. the second one will set thread number for cert generation

mspublic commented 1 month ago

Looks like it does successfully send it back within .16 seconds. G3proxy displays an error within .5 seconds after that.

mspublic commented 1 month ago

Looks like if I set cache_request_timeout and query_wait_timeout it significantly reduces the frequency of the error. Though I do tend to see more "Connection reset by peer" errors when those are set.

query_wait_timeout: 4s
cache_request_timeout: 4s
Aug 15 12:46:03.079 INFO c_rd_bytes=0, c_wr_bytes=0, client_addr=127.0.0.1:60461, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:60485, next_peer_addr=23.198.7.12:443, pid=38637, r_rd_bytes=0, r_wr_bytes=0, ready_time=6.840ms, reason=ClientTcpReadFailed, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T17:46:03.072489Z, task_id=3e1595ff5b2e11ef9a65c3d4441663e6, task_type=TcpConnect, tcp_connect_spend=6.619ms, tcp_connect_tries=1, total_time=6.978ms, upstream=res-1.cdn.office.net:443, wait_time=92.375µs, tcp read from client: Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" }
zh-jq commented 1 month ago

I will add more trace level logs today to see if there are other non-timeout problems

zh-jq commented 1 month ago

Looks like if I set cache_request_timeout and query_wait_timeout it significantly reduces the frequency of the error. Though I do tend to see more "Connection reset by peer" errors when those are set.

query_wait_timeout: 4s
cache_request_timeout: 4s
Aug 15 12:46:03.079 INFO c_rd_bytes=0, c_wr_bytes=0, client_addr=127.0.0.1:60461, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:60485, next_peer_addr=23.198.7.12:443, pid=38637, r_rd_bytes=0, r_wr_bytes=0, ready_time=6.840ms, reason=ClientTcpReadFailed, server_addr=127.0.0.1:4128, server_name=mitm, server_type=HttpProxy, stage=Relaying, start_at=2024-08-15T17:46:03.072489Z, task_id=3e1595ff5b2e11ef9a65c3d4441663e6, task_type=TcpConnect, tcp_connect_spend=6.619ms, tcp_connect_tries=1, total_time=6.978ms, upstream=res-1.cdn.office.net:443, wait_time=92.375µs, tcp read from client: Os { code: 54, kind: ConnectionReset, message: "Connection reset by peer" }

you can search the same task_id in the log to see in which protocol state the client closed the connection by sending a tcp reset

mspublic commented 1 month ago

Thanks for adding additional debug info. It seems that g3fcgen is able to get the cert but g3proxy is having issues getting it from there. Could it be related to the cache? Here is the latest:

g3fcgen:

Aug 19 10:08:06.573 DEBG Worker#0 got certificate for host usc-word-edit.officeapps.live.com <g3fcgen::backend(mod.rs:166)>
Aug 19 10:08:06.573 DEBG usc-word-edit.officeapps.live.com - duration: 99526625ns, rsp size: 3854 <g3fcgen(lib.rs:139)>

Here is g3proxy related task_id

Aug 19 10:08:06.438 INFO auditor_name=default, daemon_name=, depth=1, log_type=Inspect, pid=36930, protocol=tls_modern, source=stream inspection, task_id=d701257f5e3c11ef89e79e4f7c53d357, ()

Aug 19 10:08:10.481 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=36930, task_id=d701257f5e3c11ef89e79e4f7c53d357, upstream=usc-word-edit.officeapps.live.com:443, no fake cert generated: failed to get fake upstream certificate

Aug 19 10:08:10.500 INFO c_rd_bytes=681, c_wr_bytes=0, client_addr=127.0.0.1:51108, daemon_name=, escaper=default, log_type=Task, next_bound_addr=192.168.50.94:51110, next_peer_addr=52.108.8.12:443, pid=36930, r_rd_bytes=6928, r_wr_bytes=711, ready_time=15.552ms, reason=InterceptionError, server_addr=127.0.0.1:4128, server_name=http-proxy, server_type=HttpProxy, stage=Relaying, start_at=2024-08-19T15:08:06.422443Z, task_id=d701257f5e3c11ef89e79e4f7c53d357, task_type=TcpConnect, tcp_connect_spend=9.545ms, tcp_connect_tries=1, total_time=4.059s, upstream=usc-word-edit.officeapps.live.com:443, wait_time=113.500µs, tls_modern interception error: tls: no fake cert generated: failed to get fake upstream certificate
zh-jq commented 1 month ago

Do you see any cert generation related logs between Aug 19 10:08:06.438 and Aug 19 10:08:10.481?

mspublic commented 1 month ago

I was able to reproduce on a different site. Looks like I missed a debug log

Aug 19 11:36:18.433 DEBG cert generation query timeout for b1t-chidc2.zemanta.com <g3_cert_agent::query(query.rs:146)>

Aug 19 11:36:18.433 INFO auditor_name=default, daemon_name=, depth=1, intercept_type=TlsHandshake, log_type=Intercept, pid=40017, task_id=26e5e97f5e4911efb8f89b435b7523e6, upstream=b1t-chidc2.zemanta.com:443, no fake cert generated: failed to get fake upstream certificate
Aug 19 11:36:14.762 DEBG Worker#1 got certificate for host b1t-chidc2.zemanta.com <g3fcgen::backend(mod.rs:166)>
Aug 19 11:36:14.763 DEBG b1t-chidc2.zemanta.com - duration: 334989000ns, rsp size: 2969 <g3fcgen(lib.rs:139)>
zh-jq-b commented 1 month ago

01072f6a117f050d2cfaf15a17b565adb1ffd2b0 this commit should fix the problem

mspublic commented 1 month ago

Looks good! Thanks!