NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
25 stars 12 forks source link

Threads hang on Kubernetes instance #1792

Open artntek opened 5 months ago

artntek commented 5 months ago

This is for the metacat deployment in k8s. (It may also happen in legacy deployments, but has hitherto gone undetected?) The example below is for replication; however, it is assumed that other operations that create threads would also be affected (e.g. changing system metadata).

NOTE: I could not reproduce this using the steps below, even when looping 1000 times. The actual bug was observed while testing replication in the dev cluster. We noticed the symptoms in the OBSERVED section below, and surmised the threads were not starting. It's unclear how the container first got into this state, though...

STEPS TO REPRODUCE

Basically, you need to start a bunch of threads...

  1. prerequisite: you need sudo access to the sandbox cn cn-sandbox-ucsb-1.test.dataone.org

  2. Set up a metacat instance with a test node cert, to be a MN.

  3. Enable settings for replication with the sandbox CN; for example (values.yaml override):

    
    ## DataONE Member Node (MN) Parameters
    dataone.certificate.fromHttpHeader.enabled: true
    dataone.autoRegisterMemberNode: 2024-01-25
    D1Client.CN_URL: https://cn-sandbox.test.dataone.org/cn
    dataone.nodeId: "urn:node:TestBROOKELT"
    dataone.subject: "CN=urn:node:TestBROOKELT,DC=dataone,DC=org"
    dataone.nodeName: Test BROOKE LT Metacat Node
    dataone.nodeDescription: Dev cluster Test BROOKE LT Metacat Node
    dataone.contactSubject: http://orcid.org/0000-0002-1472-913X
    dataone.nodeSynchronize: true
    dataone.nodeReplicate: true
    dataone.replicationpolicy.default.numreplicas: "1"
  4. Retrieve a valid sysmeta file from the CN like this one, and save it to a file named systemmeta.xml, in the same directory where you will execute the following step:

    <ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
    <serialVersion>31</serialVersion>
    <identifier>testReplicate.1706572592156</identifier>
    <formatId>eml://ecoinformatics.org/eml-2.0.1</formatId>
    <size>12960</size>
    <checksum algorithm="MD5">e9eba01da2e921f03c0239a3632e70ac</checksum>
    <submitter>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</submitter>
    <rightsHolder>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</rightsHolder>
    <accessPolicy>
    <allow>
    <subject>public</subject>
    <permission>read</permission>
    </allow>
    </accessPolicy>
    <replicationPolicy replicationAllowed="true" numberReplicas="1">
    <preferredMemberNode>urn:node:TestBROOKELT</preferredMemberNode>
    </replicationPolicy>
    <archived>false</archived>
    <dateUploaded>2024-01-29T23:56:35.102+00:00</dateUploaded>
    <dateSysMetadataModified>2024-01-29T23:56:35.102+00:00</dateSysMetadataModified>
    <originMemberNode>urn:node:mnSandboxUCSB1</originMemberNode>
    <authoritativeMemberNode>urn:node:mnSandboxUCSB1</authoritativeMemberNode>
    <replica>
    <replicaMemberNode>urn:node:mnSandboxUCSB1</replicaMemberNode>
    <replicationStatus>completed</replicationStatus>
    <replicaVerified>2024-01-29T23:57:11.228+00:00</replicaVerified>
    </replica>
    <replica>
    <replicaMemberNode>urn:node:cnSandbox</replicaMemberNode>
    <replicationStatus>completed</replicationStatus>
    <replicaVerified>2024-01-29T23:57:11.309+00:00</replicaVerified>
    </replica>
    </ns3:systemMetadata>
  5. run this curl command several times (eg in a loop)

    HOST_CONTEXT="https://metacat-dev.test.dataone.org/metacat"
    # replace with your MN host and context
    
    # change upper bound of seq as needed
    for i in `seq 1 100`; do 
    echo $i;
    sudo curl -E /etc/dataone/client/private/urn_node_cnSandboxUCSB1.pem \
              -F "sysmeta=@systemmeta.xml" -F "sourceNode=urn:node:mnSandboxUCSB1" \
             -X POST $HOST_CONTEXT/d1/mn/v2/replicate
    done

EXPECTED

MNResourceHandler logs a message like this:

metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]

It then starts this thread, which should make a call to MNodeService.replicate(), resulting in log output like this:

metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:
    Session.Subject      = CN=urn:node:cnSandboxUCSB1,DC=dataone,DC=org
    identifier           = testReplicate.1706727183679
    Source NodeReference =urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.dataone.MNodeService:replicate:923]

OBSERVED

MNResourceHandler logs a message like this:

metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]

However, it appears that this thread, which should make a call to MNodeService.replicate() is never started, because there is no log output like this:

metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:[...etc]

WORKAROUND

Restarting the pod seems to restore normality

taojing2002 commented 5 months ago

How many times of calling can cause the hanging? I am interested in trying this on a regular Metacat instance.

On Thu, Feb 1, 2024 at 10:44 AM Matthew B @.***> wrote:

This is for the metacat deployment in k8s. It may also happen in legacy deployments, but hitherto gone undetected? The example below is for replication; however, it is assumed that other operations that create threads would also be affected (e.g. changing system metadata). STEPS TO REPRODUCE

Basically, you need to start a bunch of threads...

  1. Set up a metacat instance with a test node cert, to be a MN.

  2. Enable settings for replication with the sandbox CN; for example (values.yaml override):

    DataONE Member Node (MN) Parameters

    dataone.certificate.fromHttpHeader.enabled: true dataone.autoRegisterMemberNode: 2024-01-25 D1Client.CN_URL: https://cn-sandbox.test.dataone.org/cn dataone.nodeId: "urn:node:TestBROOKELT" dataone.subject: "CN=urn:node:TestBROOKELT,DC=dataone,DC=org" dataone.nodeName: Test BROOKE LT Metacat Node dataone.nodeDescription: Dev cluster Test BROOKE LT Metacat Node dataone.contactSubject: http://orcid.org/0000-0002-1472-913X dataone.nodeSynchronize: true dataone.nodeReplicate: true dataone.replicationpolicy.default.numreplicas: "1"

  3. Retrieve a valid sysmeta file from the CN like this one, and save it to a file named systemmeta.xml, in the same directory where you will execute the following step:

31 testReplicate.1706572592156 eml://ecoinformatics.org/eml-2.0.1 12960 e9eba01da2e921f03c0239a3632e70ac CN=urn:node:TestBROOKELT,DC=dataone,DC=org CN=urn:node:TestBROOKELT,DC=dataone,DC=org public read urn:node:TestBROOKELT false 2024-01-29T23:56:35.102+00:00 2024-01-29T23:56:35.102+00:00 urn:node:mnSandboxUCSB1 urn:node:mnSandboxUCSB1 urn:node:mnSandboxUCSB1 completed 2024-01-29T23:57:11.228+00:00 urn:node:cnSandbox completed 2024-01-29T23:57:11.309+00:00 1. run this curl command several times (eg in a loop) curl -E /etc/dataone/client/private/urn_node_cnSandboxUCSB1.pem -F ***@***.***" -F "sourceNode=urn:node:mnSandboxUCSB1" -X POST https://metacat-dev.test.dataone.org/metacat/d1/mn/v2/replicate EXPECTED MNResourceHandler logs a message like this: metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112] It then starts this thread , which should make a call to MNodeService.replicate() , resulting in log output like this: metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters: Session.Subject = CN=urn:node:cnSandboxUCSB1,DC=dataone,DC=org identifier = testReplicate.1706727183679 Source NodeReference =urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.dataone.MNodeService:replicate:923] OBSERVED MNResourceHandler logs a message like this: metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112] However, it appears that this thread , which should make a call to MNodeService.replicate() is never started, because there is no log output like this: metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:[...etc] WORKAROUND Restarting the pod seems to restore normality — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
artntek commented 5 months ago

I don't have a good answer. 1000 calls didn't trigger it for me. Maybe just keep going until it breaks? :-)