Very slow to load sequences in UI

parkerabercrombie commented 8 months ago

Checked for duplicates

No - I haven't checked

Is this a regression?

No - This is a new bug

Version

2.4.0

Describe the bug

Observed on our test venue. Opening a sequence in the UI (running command expansion and then clicking "open" on the sequence) takes a very long time to show the resulting sequence. The network call to fetch the sequence takes ~1 min to return.

This is the query that is slow. Running this query directly in the hasura console was also slow.

#graphql
    query GetExpansionSequenceSeqJson($seqId: String!, $simulationDatasetId: Int!) {
      getSequenceSeqJson(seqId: $seqId, simulationDatasetId: $simulationDatasetId) {
        errors {
          location {
            column
            line
          }
          message
          stack
        }
        seqJson
        status
      }
    }

Reproduction

We have only observed this on our dev cloud venue. We were not able to reproduce it locally. It maybe related to the amount of data on the venue.

Logs

No response

System Info

Chrome, hosted on AWS

Severity

Major

cartermak commented 8 months ago

Additional data point: the sequencing server is now returning a 504 to Hasura for expandAllActivities (the /command-expansion/expand-all-activity-instances route on the sequencing server). There's no confirmation that expansion has run successfully in the UI, although we're able to retrieve a sequence with one activity after 4.5 minutes.

parkerabercrombie commented 8 months ago

I see this in the hasura logs from this morning. Not sure if related to the 504 Carter reported or not.

 08:36:02   
{
  "detail": {
    "http_info": {
      "content_encoding": null,
      "http_version": "HTTP/1.1",
      "ip": "100.67.28.190",
      "method": "POST",
      "status": 200,
      "url": "/v1/graphql"
    },
    "operation": {
      "error": {
        "code": "unexpected",
        "error": "database query error",
        "internal": {
          "arguments": [
            "(Oid 114,Just (\"{\\\"x-hasura-role\\\":\\\"admin\\\"}\",Binary))",
            "(Oid 23,Just (\"\\NUL\\NULI&\",Binary))"
          ],
          "error": {
            "description": null,
            "exec_status": "FatalError",
            "hint": "You might need to increase max_locks_per_transaction.",
            "message": "out of shared memory",
            "status_code": "53200"
          },
          "prepared": true,
          "statement": "SELECT  coalesce(json_agg(\"root\" ), '[]' ) AS \"root\" FROM  (SELECT  row_to_json((SELECT  \"_e\"  FROM  (SELECT  \"_root.base\".\"type\" AS \"type\", \"_root.base\".\"duration\" AS \"duration\", \"_root.ar.root.profile_segments\".\"profile_segments\" AS \"profile_segments\", \"_root.base\".\"name\" AS \"name\"       ) AS \"_e\"      ) ) AS \"root\" FROM  (SELECT  *  FROM \"public\".\"profile\"  WHERE ((\"public\".\"profile\".\"dataset_id\") = (($2)::integer))     ) AS \"_root.base\" LEFT OUTER JOIN LATERAL (SELECT  coalesce(json_agg(\"profile_segments\" ), '[]' ) AS \"profile_segments\" FROM  (SELECT  row_to_json((SELECT  \"_e\"  FROM  (SELECT  \"_root.ar.root.profile_segments.base\".\"start_offset\" AS \"start_offset\", \"_root.ar.root.profile_segments.base\".\"dynamics\" AS \"dynamics\", \"_root.ar.root.profile_segments.base\".\"is_gap\" AS \"is_gap\"       ) AS \"_e\"      ) ) AS \"profile_segments\" FROM  (SELECT  *  FROM \"public\".\"profile_segment\"  WHERE (((\"_root.base\".\"id\") = (\"profile_id\")) AND ((\"_root.base\".\"dataset_id\") = (\"dataset_id\")))     ) AS \"_root.ar.root.profile_segments.base\"      ) AS \"_root.ar.root.profile_segments\"      ) AS \"_root.ar.root.profile_segments\" ON ('true')      ) AS \"_root\"      "
        },
        "path": "$"
      },
      "query": {
        "query": "query{\n  profile(where: {dataset_id: {_eq: 18726}}){\n    type\n    duration\n    profile_segments {\n      start_offset\n      dynamics\n      is_gap\n    }\n    name\n  }\n}\n"
      },
      "request_id": "d42042e2-9d1c-4af8-8c00-a0eb3906296f",
      "request_mode": "error",
      "response_size": 1666,
      "user_vars": {
        "x-hasura-role": "admin"
      }
    },
    "request_id": "d42042e2-9d1c-4af8-8c00-a0eb3906296f"
  },
  "level": "error",
  "timestamp": "2024-03-15T15:36:02.387+0000",
  "type": "http-log"
}

dandelany commented 8 months ago

Thanks for the report - @goetzrrGit and @Mythicaeda are looking into this to see if we can prevent the 504 error and/or improve performance.

dandelany commented 8 months ago

Here's an update on our progress today... @goetzrrGit and @Mythicaeda have been working on this for most of the day today and were able to replicate the slow query mentioned at the top of the ticket (GetExpansionSequenceSeqJson):

@Mythicaeda and @goetzrrGit found a fix that improves the performance of this query from roughly >15 seconds to 1 second
The fix is being wrapped up and we expect it to be PR’ed by monday and reviewed by Tuesday at the latest. Since it’s a critical fix, we've decided to hold on the Aerie 2.6.0 release to include it, our plan is to release 2.6.0 by Tuesday EOD.
@parkerabercrombie we can discuss whether you're able to take this 2.6.0 release in your venue or if we need to also release a 2.4.1 or 2.5.1 hotfix - slightly complicated by part 2 of this...:

The comment from @cartermak above (“sequencing server returning a 504 to Hasura for expandAllActivities … although we’re able to retrieve a sequence with one activity after 4.5 minutes.“) is a separate issue which is unrelated to the originally reported slow query

Per discussion with @goetzrrGit, we believe this will see a big improvement from the the new background transpilation process which was implemented in Aerie 2.5.0
However, there may be other contributing factors, potentially some suboptimal config in the Clipper environment, since it appears to be only using 2 CPUs and one worker, and this endpoint was way faster on Theresa’s local machine than on the server
Ultimately it’s difficult to separate these as potential root causes or debug them until the Clipper environment is updated to at least 2.5.0 and re-tested under @goetzrrGit 's observation - let us know if that is a possibility.
Additionally, the fact that a 504 (timeout) error was emitted can/should be fixed by changing our code to treat this as a deferred/lazy/background operation, and not holding up the response waiting for completion. We'll handle this in a separate issue, but this would not directly address the root performance issue, it would just improve the messaging & prevent timeout errors.

cartermak commented 8 months ago

Thank you for the work and the update @dandelany ! @parkerabercrombie , let me know whether/how I can support.

parkerabercrombie commented 8 months ago

Thanks @dandelany ! We can try changing the configuration on that server on Monday.

NASA-AMMOS / aerie