h3czxp commented 2 years ago

1.When test the cortx cluster for about an hour using tool warp, the s3authserver is not working normally.

2.When putting a large number of objects, the tool warp occurs error "gateway time-out", and the s3server occurs error "S3 Get request failed. HTTP status code = 404".

How to solve the above problems?

Thanks Xianpeng Zhou

welcome[bot] commented 2 years ago

Thanks for opening this issue. A contributor should be by to give feedback soon. In the meantime, please check out the contributing guidelines and explore other ways you can get involved.

cortx-admin commented 2 years ago

For the convenience of the Seagate development team, this issue has been mirrored in a private Seagate Jira Server: https://jts.seagate.com/browse/EOS-28345. Note that community members will not be able to access that Jira server but that is not a problem since all activity in that Jira mirror will be copied into this GitHub issue.

gregnsk commented 2 years ago

@h3czxp - thank you for reporting it. Could you share more details about the test scenario? How many objects were created, what are the parameters of your warp command? Are you running the latest CORTX version in Kubernetes?

Could you supply s3auth server logs (/var/log/seagate/auth/server) when the issue happens?

h3czxp commented 2 years ago

@gregnsk - There are three nodes for the cluster. Each node has 96 cores and 25Gb/s network. The parameters of warp is "./warp get --duration=5m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=1024KiB --objects=10000 --concurrent=100".

We are using the newest rpm packets provided by your company to deploy out cluster, not in Kubernetes.

The s3auth server and s3server logs are below.

s3server.node51.invalid-user.log.ERROR.20220126-073648.87646.txt app.log app-2022-01-26-00.log.gz app-2022-01-26-01.log.gz app-2022-01-26-02.log.gz app-2022-01-26-03.log.gz app-2022-01-26-04.log.gz app-2022-01-26-05.log.gz app-2022-01-26-06.log.gz app-2022-01-26-07.log.gz app-2022-01-26-09.log.gz app-2022-01-26-10.log.gz app-2022-01-26-20.log.gz

s3server_INFO_WARNING_log.zip

t7ko-seagate commented 2 years ago

1.When test the cortx cluster for about an hour using tool warp, the s3authserver is not working normally.

What are the symptoms? How exactly it is failing?

t7ko-seagate commented 2 years ago

Initial logs analysis.

app.log files are clean. No failures or exceptions, all responses are 200 OK (auth successful).

ERROR log reports two kinds of errors:

error 404
motr error -110

E0126 07:36:48.065618 87646 s3_get_object_action.cc:1082] [send_response_to_s3_client] [ReqID: d104772c-2103-4290-b337-4986aa55a8b6] S3 Get request failed. HTTP status code = 404

E0126 20:21:42.594319 87968 s3_motr_rw_common.cc:174] [s3_motr_op_common] [ReqID: 4483f40f-11fd-4886-8404-8cfb48df75ec] Error code = -110 op_code = 11

Error -110 is causing failures to some write operations and some delete-object operations.

Error 110 is "connection timed out" -- since it comes from libmotr, I assume the reason is connection loss between s3 and Motr (probably due to m0d going down?).

These errors are not present in the detailed INFO logs. Reason -- log rotation. ERROR log ends at Jan 26th, INFO log is captured on Jan 27th.

INFO logs include:

a lot of successfully completed operations
7 failures due to API user supplying wrong signature/keys (error 403)
2 internal failures (error 500) -- both happen at the same time (within half second), due to inability to connect to Auth server. Most probably caused by auth server restart.

With these logs it is not possible to debug further.

Hi @h3czxp --

To help us troubleshoot, can you please:

Check /var/crash for any recent core files.
Clarify on how exactly you observe that auth server is failing.
Reproduce the 2nd failure (Gateway Timeout), and capture app.log and all s3server logs immediately -- so we have full detailed info (before logs are rotated).

t7ko-seagate commented 2 years ago

One more note on error 404.

We observed this error in a first warp invocation after cluster restart.

As a work-around, while we're looking into the issue, you can use this: after cluster restart, trigger some quick warp workload (small objects, small number, short time period), wait until it completes and ignore the result. Then start your main workload.

cortx-admin commented 2 years ago

Ivan Tishchenko commented in Jira Server:

GitHub issue has been re-opened. h3c still need help.

cortx-admin commented 2 years ago

Ivan Tishchenko commented in Jira Server:

Initial logs analysis.

last app.log is clean. have not yet checked other app.log archives.

ERROR log reports two kinds of errors:

error 404
motr error -110

UPD -- posted detailed analysis in GitHub. Comments from JIRA do not seem to be mirrored back to GitHub.

h3czxp commented 2 years ago

@t7ko-seagate Thanks for your reply! 1.There are core files for m0d in /var/crash, but the files are too large to upload. There is the stack information below.

(gdb) bt

0 0x00007ff9cfc58387 in raise () from /lib64/libc.so.6

1 0x00007ff9cfc59a78 in abort () from /lib64/libc.so.6

2 0x00007ff9d1c2c42d in m0_arch_panic (c=c@entry=0x7ff9d2037d20 <__pctx.8265>, ap=ap@entry=0x7ff702ffc298) at lib/user_space/uassert.c:131

3 0x00007ff9d1c1b7c4 in m0_panic (ctx=ctx@entry=0x7ff9d2037d20 <__pctx.8265>) at lib/assert.c:52

4 0x00007ff9d1b51090 in mem_alloc (zonemask=2, size=, tx=0x7ff4884dcdb0, btree=0x400000139f90) at be/btree.c:127

5 btree_save (tree=tree@entry=0x400000139f90, tx=tx@entry=0x7ff4884dcdb0, op=op@entry=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, val=val@entry=0x0,

anchor=anchor@entry=0x7ff4884dd578, optype=optype@entry=BTREE_SAVE_OVERWRITE, zonemask=zonemask@entry=2) at be/btree.c:1453

6 0x00007ff9d1b529d5 in m0_be_btree_save_inplace (tree=0x400000139f90, tx=0x7ff4884dcdb0, op=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, anchor=0x7ff4884dd578,

overwrite=<optimized out>, zonemask=2) at be/btree.c:2148

7 0x00007ff9d1b8107a in ctg_op_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, next_phase=48) at cas/ctg_store.c:1021

8 0x00007ff9d1b81536 in ctg_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, next_phase=next_phase@entry=48)

at cas/ctg_store.c:1231

9 0x00007ff9d1b81c56 in m0_ctg_insert (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, val=val@entry=0x7ff702ffca70,

next_phase=next_phase@entry=48) at cas/ctg_store.c:1257

10 0x00007ff9d1b7afd9 in cas_exec (next=48, rec_pos=0, ctg=0x400000139f70, ct=CT_BTREE, opc=CO_PUT, fom=0x7ff4884dcce0) at cas/service.c:2184

11 cas_fom_tick (fom0=0x7ff4884dcce0) at cas/service.c:1473

12 0x00007ff9d1bed0bb in fom_exec (fom=0x7ff4884dcce0) at fop/fom.c:791

13 loc_handler_thread (th=0x1c83970) at fop/fom.c:931

14 0x00007ff9d1c21f8e in m0_thread_trampoline (arg=arg@entry=0x1c83978) at lib/thread.c:117

15 0x00007ff9d1c2d18d in uthread_trampoline (arg=0x1c83978) at lib/user_space/uthread.c:98

16 0x00007ff9d139aea5 in start_thread () from /lib64/libpthread.so.0

17 0x00007ff9cfd2096d in clone () from /lib64/libc.so.6

static inline void mem_alloc(const struct m0_be_btree btree, struct m0_be_tx tx, m0_bcount_t size, uint64_t zonemask) { void p;

M0_BE_OP_SYNC(op,
          m0_be_alloc_aligned(tree_allocator(btree),
                  tx, &op, &p, size,
                  BTREE_ALLOC_SHIFT,
                  zonemask));
M0_ASSERT(p != NULL);
return p;

} 2.There are warning log "Unable to connect to Auth server" and "Socket error: Connection reset by peer, errno: 104, set errtype: Reading Error" when auth server is failing in the warning file uploaded before.

madhavemuri commented 2 years ago

@h3czxp : Can you update following parameter to higher value (in multiple of 4K) equal to the meta-data disk used.

/etc/sysconfig/motr

# Backend segment size (in bytes) for IO service. Default is 4 GiB.
#MOTR_M0D_IOS_BESEG_SIZE=4294967296

Make sure that mkfs is done after this setting is done on all the nodes.

cc: @huanghua78

h3czxp commented 2 years ago

@t7ko-seagate I reproduce the 2nd failure (Gateway Timeout) using "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100". The s3server logs are too large to upload.

app_log_node51.zip app_log_node52.zip app_log_node53.zip

h3czxp commented 2 years ago

@madhavemuri : Thank you very much for your advice. The parameter "MOTR_M0D_IOS_BESEG_SIZE=4294967296" in file /etc/sysconfig/motr is disable default. We enable the parameter and deploy our cluster. We execute the test case for more than three hours and the s3 auth service doesn't fail. Now, we are furture testing.

h3czxp commented 2 years ago

@madhavemuri : When we update the parameter "MOTR_M0D_IOS_BESEG_SIZE" to the value(200G) equal to the meta-data disk used, there are errors in deploying cluster.

[root@node51 kyk]# hctl bootstrap --mkfs CDF_2+1.yaml 2022-01-28 00:44:49: Generating cluster configuration... OK 2022-01-28 00:44:55: Starting Consul server on this node......... OK 2022-01-28 00:45:02: Importing configuration into the KV store... OK 2022-01-28 00:45:03: Starting Consul on other nodes...Consul ready on all nodes 2022-01-28 00:45:04: Updating Consul configuraton from the KV store... OK 2022-01-28 00:45:10: Waiting for the RC Leader to get elected...../opt/seagate/cortx/hare/bin/../libexec/hare-bootstrap: line 447: ((: == 1 : syntax error: operand expected (error token is "== 1 ") OK 2022-01-28 00:45:13: Starting Motr (phase1, mkfs)... OK 2022-01-28 00:45:21: Starting Motr (phase1, m0d)... OK 2022-01-28 00:45:23: Starting Motr (phase2, mkfs)...Job for motr-mkfs@0x7200000000000001:0x165.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x165.service" and "journalctl -xe" for details. Error at node52 with command PATH=/opt/seagate/cortx/hare/bin/../bin:/opt/seagate/cortx/hare/bin/../libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/bin:/root/bin /opt/seagate/cortx/hare/libexec/bootstrap-node --mkfs-only --phase phase2 --xprt libfab Job for motr-mkfs@0x7200000000000001:0x22e.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x22e.service" and "journalctl -xe" for details. Error at node53 with command PATH=/opt/seagate/cortx/hare/bin/../bin:/opt/seagate/cortx/hare/bin/../libexec:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/bin:/root/bin /opt/seagate/cortx/hare/libexec/bootstrap-node --mkfs-only --phase phase2 --xprt libfab Job for motr-mkfs@0x7200000000000001:0x9c.service failed because the control process exited with error code. See "systemctl status motr-mkfs@0x7200000000000001:0x9c.service" and "journalctl -xe" for details.

Reading symbols from /usr/sbin/m0mkfs...Reading symbols from /usr/lib/debug/usr/sbin/m0mkfs.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/m0mkfs -e libfab:inet:tcp:172.16.27.52@3011 -A linuxstob:addb-stobs -'. Program terminated with signal 6, Aborted.

0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt

0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6

1 0x00007f9fa4ebea78 in abort () from /lib64/libc.so.6

2 0x00007f9fa6e9142d in m0_arch_panic (c=c@entry=0x7f9fa72a2760 <__pctx.10349>, ap=ap@entry=0x7f9c197f9648) at lib/user_space/uassert.c:131

3 0x00007f9fa6e807c4 in m0_panic (ctx=ctx@entry=0x7f9fa72a2760 <__pctx.10349>) at lib/assert.c:52

4 0x00007f9fa6dc22fb in be_io_cb (link=0x4fe8c50) at be/io.c:570

5 0x00007f9fa6e81d47 in clink_signal (clink=clink@entry=0x4fe8c50) at lib/chan.c:135

6 0x00007f9fa6e81d9a in chan_signal_nr (chan=chan@entry=0x4fe8b48, nr=0) at lib/chan.c:154

7 0x00007f9fa6e81e1d in m0_chan_broadcast (chan=chan@entry=0x4fe8b48) at lib/chan.c:174

8 0x00007f9fa6e81e39 in m0_chan_broadcast_lock (chan=chan@entry=0x4fe8b48) at lib/chan.c:181

9 0x00007f9fa6f567e3 in ioq_complete (res2=, res=, qev=, ioq=0x3f74a10) at stob/ioq.c:587

10 stob_ioq_thread (ioq=0x3f74a10) at stob/ioq.c:669

11 0x00007f9fa6e86f8e in m0_thread_trampoline (arg=arg@entry=0x3f74d10) at lib/thread.c:117

12 0x00007f9fa6e9218d in uthread_trampoline (arg=0x3f74d10) at lib/user_space/uthread.c:98

13 0x00007f9fa65ffea5 in start_thread () from /lib64/libpthread.so.0

14 0x00007f9fa4f8596d in clone () from /lib64/libc.so.6

t7ko-seagate commented 2 years ago

One more note -- on error 404.

We observed this error in a first warp invocation after cluster restart.

As a work-around, while we're looking into the issue, you can use this: after cluster restart, trigger some quick warp workload (small objects, small number, short time period), wait until it completes and ignore the result. Then start your main workload.

h3czxp commented 2 years ago

@madhavemuri We execute the test case "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100" again,. Unlukily, gateway time-out error occurs again.

Reading symbols from /usr/bin/m0d...Reading symbols from /usr/lib/debug/usr/bin/m0d.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/m0d -e libfab:inet:tcp:172.16.27.52@3009 -A linuxstob:addb-stobs -f <0'. Program terminated with signal 6, Aborted.

0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt

0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6

1 0x00007f9c0e923a78 in abort () from /lib64/libc.so.6

2 0x00007f9c108f642d in m0_arch_panic (c=c@entry=0x7f9c10e21f00 <__pctx.7416>, ap=ap@entry=0x7f9bd97a14a8) at lib/user_space/uassert.c:131

3 0x00007f9c108e57c4 in m0_panic (ctx=ctx@entry=0x7f9c10e21f00 <__pctx.7416>) at lib/assert.c:52

4 0x00007f9c1098c05e in m0_sm_ast_post (grp=, ast=ast@entry=0x7f9c10ed1938 <fdmi_global_src_dock+1784>) at sm/sm.c:138

5 0x00007f9c108afd2f in m0_fdmi__src_dock_fom_wakeup (sd_fom=sd_fom@entry=0x7f9c10ed12d0 <fdmi_global_src_dock+144>) at fdmi/source_dock_fom.c:344

6 0x00007f9c108acfd8 in m0_fdmi__enqueue_locked (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:247

7 0x00007f9c108ad03f in m0_fdmi__enqueue (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:257

8 0x00007f9c108ad6d7 in m0_fdmi__record_post (src_rec=0x7f96ccb685f0) at fdmi/source_dock.c:280

9 0x00007f9c108a9a82 in m0_fol_fdmi_post_record (fom=fom@entry=0x7f96ccb681a0) at fdmi/fol_fdmi_src.c:688

10 0x00007f9c108b7d0a in m0_fom_fdmi_record_post (fom=fom@entry=0x7f96ccb681a0) at fop/fom.c:1745

11 0x00007f9c108b8566 in m0_fom_tx_logged_wait (fom=0x7f96ccb681a0) at fop/fom_generic.c:472

12 0x00007f9c108b8bb2 in m0_fom_tick_generic (fom=fom@entry=0x7f96ccb681a0) at fop/fom_generic.c:860

13 0x00007f9c10845193 in cas_fom_tick (fom0=0x7f96ccb681a0) at cas/service.c:1218

14 0x00007f9c108b70bb in fom_exec (fom=0x7f96ccb681a0) at fop/fom.c:791

15 loc_handler_thread (th=0x1793030) at fop/fom.c:931

16 0x00007f9c108ebf8e in m0_thread_trampoline (arg=arg@entry=0x1793038) at lib/thread.c:117

17 0x00007f9c108f718d in uthread_trampoline (arg=0x1793038) at lib/user_space/uthread.c:98

18 0x00007f9c10064ea5 in start_thread () from /lib64/libpthread.so.0

19 0x00007f9c0e9ea96d in clone () from /lib64/libc.so.6

h3czxp commented 2 years ago

@t7ko-seagate We are executing our test case as what you say, but gateway time-out error occurs when increasing object num.

stale[bot] commented 2 years ago

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nileshgovande @bkirunge7 @knrajnambiar76 @t7ko-seagate for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

madhavemuri commented 2 years ago

@madhavemuri We execute the test case "./warp get --duration=2m --warp-client=172.16.27.51:6000,172.16.27.52:6000,172.16.27.53:6000 --host=172.16.27.51:80,172.16.27.52:80,172.16.27.53:80 --access-key=AKIAbYLPub8LSBWbZr8mtSDgZg --secret-key=z9QLkRjvg6uKyOjiaJvF8HJutUKvHyZQVtLuibTS --obj.size=64KiB --objects=1000000 --concurrent=100" again,. Unlukily, gateway time-out error occurs again.

Reading symbols from /usr/bin/m0d...Reading symbols from /usr/lib/debug/usr/bin/m0d.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/d2/b0d6d232d66c778f97273965ae044c718e7d2b [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/m0d -e libfab:inet:tcp:172.16.27.52@3009 -A linuxstob:addb-stobs -f <0'. Program terminated with signal 6, Aborted. #0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-317.el7.x86_64 isa-l-2.30.0-1.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libfabric-1.11.2-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libyaml-0.1.4-11.el7_0.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt #0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6 #1 0x00007f9c0e923a78 in abort () from /lib64/libc.so.6 #2 0x00007f9c108f642d in m0_arch_panic (c=c@entry=0x7f9c10e21f00 <__pctx.7416>, ap=ap@entry=0x7f9bd97a14a8) at lib/user_space/uassert.c:131 #3 0x00007f9c108e57c4 in m0_panic (ctx=ctx@entry=0x7f9c10e21f00 <__pctx.7416>) at lib/assert.c:52 #4 0x00007f9c1098c05e in m0_sm_ast_post (grp=, ast=ast@entry=0x7f9c10ed1938 <fdmi_global_src_dock+1784>) at sm/sm.c:138 #5 0x00007f9c108afd2f in m0_fdmisrc_dock_fom_wakeup (sd_fom=sd_fom@entry=0x7f9c10ed12d0 <fdmi_global_src_dock+144>) at fdmi/source_dock_fom.c:344 #6 0x00007f9c108acfd8 in m0_fdmi__enqueue_locked (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:247 #7 0x00007f9c108ad03f in m0_fdmienqueue (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:257 #8 0x00007f9c108ad6d7 in m0_fdmi__record_post (src_rec=0x7f96ccb685f0) at fdmi/source_dock.c:280 #9 0x00007f9c108a9a82 in m0_fol_fdmi_post_record (fom=fom@entry=0x7f96ccb681a0) at fdmi/fol_fdmi_src.c:688 #10 0x00007f9c108b7d0a in m0_fom_fdmi_record_post (fom=fom@entry=0x7f96ccb681a0) at fop/fom.c:1745 #11 0x00007f9c108b8566 in m0_fom_tx_logged_wait (fom=0x7f96ccb681a0) at fop/fom_generic.c:472 #12 0x00007f9c108b8bb2 in m0_fom_tick_generic (fom=fom@entry=0x7f96ccb681a0) at fop/fom_generic.c:860 #13 0x00007f9c10845193 in cas_fom_tick (fom0=0x7f96ccb681a0) at cas/service.c:1218 #14 0x00007f9c108b70bb in fom_exec (fom=0x7f96ccb681a0) at fop/fom.c:791 #15 loc_handler_thread (th=0x1793030) at fop/fom.c:931 #16 0x00007f9c108ebf8e in m0_thread_trampoline (arg=arg@entry=0x1793038) at lib/thread.c:117 #17 0x00007f9c108f718d in uthread_trampoline (arg=0x1793038) at lib/user_space/uthread.c:98 #18 0x00007f9c10064ea5 in start_thread () from /lib64/libpthread.so.0 #19 0x00007f9c0e9ea96d in clone () from /lib64/libc.so.6

@h3czxp : We too have observed the similar issue occurring intermittently, which seems to be a regression. We are actively working on it, will post the updates here about the issue.

huanghua78 commented 2 years ago

This will be fixed in https://jts.seagate.com/browse/EOS-27717. (This is not accessible for external users).

h3czxp commented 2 years ago

@madhavemuri Hi, when gateway time-out occurs, the log(INFO) output contains the following error message. What are the reasons the error?

I0208 21:06:20.112838 19696 s3server.cc:134] [on_client_request_error] [ReqID: -] S3 Client disconnected: Reading Error

I0208 21:06:20.112844 19696 request_object.h:293] [client_has_disconnected] [ReqID: 84a6a5120b98] S3 Client disconnected.

in file third_party/libevent/include/events/bufferevent.h

define BEV_EVENT_READING 0x01 /**< error encountered while reading* /

define BEV_EVENT_WRITING 0x02 /**< error encountered while writing* /

define BEV_EVENT_EOF 0x10 /**< eof file reached* /

define BEV_EVENT_ERROR 0x20 /**< unrecoverable error encountered* /

define BEV_EVENT_TIMEOUT 0x40 /**< user-specified timeout reached* /

define BEV_EVENT_CONNECTED 0x80 /**< connect operation finished.* /

t7ko-seagate commented 2 years ago

Exact meaning of this error is, cortx-s3server sees the API request socket being closed. On this level, this is the socket between haproxy and s3server processes. In turn, this socket may be closed due to the following reasons:

1- too short timeout configured on the client
2- some internal operation takes too long time, and client times out
3- haproxy misconfiguration (too short timeouts; or - when using additional haproxy instances to distribute the load across CORTX nodes, kind of on top of haproxy which runs on each node already).

Most probably you ran into case nr.2, as I think you mentioned you see this error when you increase the load.

h3czxp commented 2 years ago

@t7ko-seagate How much load do you use in your test cases? What size of the objects and how many objects do you use?

gregnsk commented 2 years ago

@h3czxp - we're planning to release a new development version in ~2 weeks. In that release we should address the Motr issue and replace S3server with RGW. @madhavemuri, @osowski, @huanghua78 - FYI We'll let you know when this version is ready for installation in your dev environment

h3czxp commented 2 years ago

@gregnsk-Wow, that sounds great! We are looking forward to the new version.

stale[bot] commented 2 years ago

This issue/pull request has been marked as needs attention as it has been left pending without new activity for 4 days. Tagging @nileshgovande @bkirunge7 @knrajnambiar76 @t7ko-seagate for appropriate assignment. Sorry for the delay & Thank you for contributing to CORTX. We will get back to you as soon as possible.

Seagate / cortx-s3server

s3authserver not working normally and gateway time-out #1759

0 0x00007ff9cfc58387 in raise () from /lib64/libc.so.6

1 0x00007ff9cfc59a78 in abort () from /lib64/libc.so.6

2 0x00007ff9d1c2c42d in m0_arch_panic (c=c@entry=0x7ff9d2037d20 <__pctx.8265>, ap=ap@entry=0x7ff702ffc298) at lib/user_space/uassert.c:131

3 0x00007ff9d1c1b7c4 in m0_panic (ctx=ctx@entry=0x7ff9d2037d20 <__pctx.8265>) at lib/assert.c:52

4 0x00007ff9d1b51090 in mem_alloc (zonemask=2, size=, tx=0x7ff4884dcdb0, btree=0x400000139f90) at be/btree.c:127

5 btree_save (tree=tree@entry=0x400000139f90, tx=tx@entry=0x7ff4884dcdb0, op=op@entry=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, val=val@entry=0x0,

6 0x00007ff9d1b529d5 in m0_be_btree_save_inplace (tree=0x400000139f90, tx=0x7ff4884dcdb0, op=0x7ff4884dd360, key=key@entry=0x7ff4884dd8e0, anchor=0x7ff4884dd578,

7 0x00007ff9d1b8107a in ctg_op_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, next_phase=48) at cas/ctg_store.c:1021

8 0x00007ff9d1b81536 in ctg_exec (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, next_phase=next_phase@entry=48)

9 0x00007ff9d1b81c56 in m0_ctg_insert (ctg_op=ctg_op@entry=0x7ff4884dd350, ctg=ctg@entry=0x400000139f70, key=key@entry=0x7ff702ffca20, val=val@entry=0x7ff702ffca70,

10 0x00007ff9d1b7afd9 in cas_exec (next=48, rec_pos=0, ctg=0x400000139f70, ct=CT_BTREE, opc=CO_PUT, fom=0x7ff4884dcce0) at cas/service.c:2184

11 cas_fom_tick (fom0=0x7ff4884dcce0) at cas/service.c:1473

12 0x00007ff9d1bed0bb in fom_exec (fom=0x7ff4884dcce0) at fop/fom.c:791

13 loc_handler_thread (th=0x1c83970) at fop/fom.c:931

14 0x00007ff9d1c21f8e in m0_thread_trampoline (arg=arg@entry=0x1c83978) at lib/thread.c:117

15 0x00007ff9d1c2d18d in uthread_trampoline (arg=0x1c83978) at lib/user_space/uthread.c:98

16 0x00007ff9d139aea5 in start_thread () from /lib64/libpthread.so.0

17 0x00007ff9cfd2096d in clone () from /lib64/libc.so.6

0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6

0 0x00007f9fa4ebd387 in raise () from /lib64/libc.so.6

1 0x00007f9fa4ebea78 in abort () from /lib64/libc.so.6

2 0x00007f9fa6e9142d in m0_arch_panic (c=c@entry=0x7f9fa72a2760 <__pctx.10349>, ap=ap@entry=0x7f9c197f9648) at lib/user_space/uassert.c:131

3 0x00007f9fa6e807c4 in m0_panic (ctx=ctx@entry=0x7f9fa72a2760 <__pctx.10349>) at lib/assert.c:52

4 0x00007f9fa6dc22fb in be_io_cb (link=0x4fe8c50) at be/io.c:570

5 0x00007f9fa6e81d47 in clink_signal (clink=clink@entry=0x4fe8c50) at lib/chan.c:135

6 0x00007f9fa6e81d9a in chan_signal_nr (chan=chan@entry=0x4fe8b48, nr=0) at lib/chan.c:154

7 0x00007f9fa6e81e1d in m0_chan_broadcast (chan=chan@entry=0x4fe8b48) at lib/chan.c:174

8 0x00007f9fa6e81e39 in m0_chan_broadcast_lock (chan=chan@entry=0x4fe8b48) at lib/chan.c:181

9 0x00007f9fa6f567e3 in ioq_complete (res2=, res=, qev=, ioq=0x3f74a10) at stob/ioq.c:587

10 stob_ioq_thread (ioq=0x3f74a10) at stob/ioq.c:669

11 0x00007f9fa6e86f8e in m0_thread_trampoline (arg=arg@entry=0x3f74d10) at lib/thread.c:117

12 0x00007f9fa6e9218d in uthread_trampoline (arg=0x3f74d10) at lib/user_space/uthread.c:98

13 0x00007f9fa65ffea5 in start_thread () from /lib64/libpthread.so.0

14 0x00007f9fa4f8596d in clone () from /lib64/libc.so.6

0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6

0 0x00007f9c0e922387 in raise () from /lib64/libc.so.6

1 0x00007f9c0e923a78 in abort () from /lib64/libc.so.6

2 0x00007f9c108f642d in m0_arch_panic (c=c@entry=0x7f9c10e21f00 <__pctx.7416>, ap=ap@entry=0x7f9bd97a14a8) at lib/user_space/uassert.c:131

3 0x00007f9c108e57c4 in m0_panic (ctx=ctx@entry=0x7f9c10e21f00 <__pctx.7416>) at lib/assert.c:52

4 0x00007f9c1098c05e in m0_sm_ast_post (grp=, ast=ast@entry=0x7f9c10ed1938 <fdmi_global_src_dock+1784>) at sm/sm.c:138

5 0x00007f9c108afd2f in m0_fdmi__src_dock_fom_wakeup (sd_fom=sd_fom@entry=0x7f9c10ed12d0 <fdmi_global_src_dock+144>) at fdmi/source_dock_fom.c:344

6 0x00007f9c108acfd8 in m0_fdmi__enqueue_locked (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:247

7 0x00007f9c108ad03f in m0_fdmi__enqueue (src_rec=src_rec@entry=0x7f96ccb685f0) at fdmi/source_dock.c:257

8 0x00007f9c108ad6d7 in m0_fdmi__record_post (src_rec=0x7f96ccb685f0) at fdmi/source_dock.c:280

9 0x00007f9c108a9a82 in m0_fol_fdmi_post_record (fom=fom@entry=0x7f96ccb681a0) at fdmi/fol_fdmi_src.c:688

10 0x00007f9c108b7d0a in m0_fom_fdmi_record_post (fom=fom@entry=0x7f96ccb681a0) at fop/fom.c:1745

11 0x00007f9c108b8566 in m0_fom_tx_logged_wait (fom=0x7f96ccb681a0) at fop/fom_generic.c:472

12 0x00007f9c108b8bb2 in m0_fom_tick_generic (fom=fom@entry=0x7f96ccb681a0) at fop/fom_generic.c:860

13 0x00007f9c10845193 in cas_fom_tick (fom0=0x7f96ccb681a0) at cas/service.c:1218

14 0x00007f9c108b70bb in fom_exec (fom=0x7f96ccb681a0) at fop/fom.c:791

15 loc_handler_thread (th=0x1793030) at fop/fom.c:931

16 0x00007f9c108ebf8e in m0_thread_trampoline (arg=arg@entry=0x1793038) at lib/thread.c:117

17 0x00007f9c108f718d in uthread_trampoline (arg=0x1793038) at lib/user_space/uthread.c:98

18 0x00007f9c10064ea5 in start_thread () from /lib64/libpthread.so.0

19 0x00007f9c0e9ea96d in clone () from /lib64/libc.so.6

define BEV_EVENT_READING 0x01 /*< error encountered while reading /

define BEV_EVENT_WRITING 0x02 /*< error encountered while writing /

define BEV_EVENT_EOF 0x10 /*< eof file reached /

define BEV_EVENT_ERROR 0x20 /*< unrecoverable error encountered /

define BEV_EVENT_TIMEOUT 0x40 /*< user-specified timeout reached /

define BEV_EVENT_CONNECTED 0x80 /*< connect operation finished. /

define BEV_EVENT_READING 0x01 /**< error encountered while reading* /

define BEV_EVENT_WRITING 0x02 /**< error encountered while writing* /

define BEV_EVENT_EOF 0x10 /**< eof file reached* /

define BEV_EVENT_ERROR 0x20 /**< unrecoverable error encountered* /

define BEV_EVENT_TIMEOUT 0x40 /**< user-specified timeout reached* /

define BEV_EVENT_CONNECTED 0x80 /**< connect operation finished.* /