directus / directus

The Modern Data Stack 🐰 — Directus is an instant REST+GraphQL API and intuitive no-code data collaboration app for any SQL database.
https://directus.io
Other
27.1k stars 3.79k forks source link

`start` hangs with no error but other commands work #16493

Closed arpadgabor closed 1 year ago

arpadgabor commented 1 year ago

Describe the Bug

I've redeployed an instance of Directus (with Docker) and out of the blue the start and schema apply commands hang. Other commands such as bootstrap or count <collection> work fine.

Also:

To Reproduce

Running directus start or directus schema apply sometimes triggers this.

Not sure if relevant, but before upgrading I was experiencing #16315 as well, now the error no longer shows up but startup or schema apply hang.

Also, does not reproduce on local, but on local I did not experience #16315.

Errors Shown

Nope. But here's a screenshot with a reproduction with LOG_STYLE=raw

image

After the last line nothing happens.

What version of Directus are you using?

9.21.0

What version of Node.js are you using?

16.18.1

What database are you using?

Postgres, postgis/postgis:13-3.3-alpine

What browser are you using?

n/a

How are you deploying Directus?

Docker

arpadgabor commented 1 year ago

I managed to track down the source to be inside the schema package when trying to fetch the schema overview. Trying to track down where exactly it hangs.

arpadgabor commented 1 year ago

Apparently this query is the culprit.

https://github.com/directus/directus/blob/eb71b5e32093b6bbc4b9ae898034aaf2355c69d5/packages/schema/src/dialects/postgres.ts#L32-L46

In my case, a count on the information_schema.columns table yields 2686 results. Limiting to 500 runs slow, 1000 will hang. Not sure why it happened just now, but will let you know if I find any solution.

Edit: Ran SELECT c.table_name, c.column_name, c.column_default FROM information_schema.columns c to test, hangs. However removing c.column_default from the query is fast again.

Edit 2: The full query without column_default is fast as well, this column causes the slowness... Example of OK query

SELECT
  c.table_name,
  c.column_name,
  c.data_type,
  c.character_maximum_length as max_length,
  c.is_generated = 'ALWAYS' is_generated,
  CASE
    WHEN c.is_identity = 'YES' THEN true
    ELSE false
  END is_identity,
  CASE
    WHEN c.is_nullable = 'YES' THEN true
    ELSE false
  END is_nullable
FROM
  information_schema.columns c
  LEFT JOIN information_schema.tables t ON c.table_name = t.table_name
WHERE
  t.table_type = 'BASE TABLE'
  AND c.table_schema IN ('public')

Edit: After more digging, looks like the issue is not related strictly to the column_default field, but to specific columns. Found out the hang is happening mainly on some id: integer fields on two collections.

arpadgabor commented 1 year ago

So apparently there is a specific column where it hangs, and it does this consistently the 4th time we fetch the columns for the database overview... but I don't understand why specifically the 4th time and why not every single time. Weird.

Edit: ran vacuum full and the freeze moved to another column.

arpadgabor commented 1 year ago

Reverted postgres to v12, still no success.. this is getting really infuriating. Maybe this issue isn't related to Directus afterall, but if anyone has any ideas how to fix this, any new idea is welcome...

rijkvanzanten commented 1 year ago

Very odd indeed... I don't really have a good answer as to why that might be happening, but it does indeed sound like the database is choking up on something. I'll close this for now as we have no way to accurately reproduce this (yet), but more than happy to keep debugging with you on this thread

rijkvanzanten commented 1 year ago

I know this is a massive schema, which makes making a repro a little difficult, but is there any good way for me to get the same error locally to test some things out?

arpadgabor commented 1 year ago

I'll try to make a repro repository in a couple of hours. I am also suspecting the transaction reaches some large amount of locks. I noticed my whole database freezes while that query is running. As in, no other query is answering in my adminer instance during that query. After I close the process everything gets back to normal. (Tried adjusting max_locks_per_transaction to 1024 but no success)

arpadgabor commented 1 year ago

Of course it does not reproduce in a new repository... Apparently I am really having a bad day today. I'll keep this thread updated if I have some new updates regarding this problem but right now, after deleting all my local volumes that I created for testing, it started working again. Maybe there was some corrupted data somewhere. I'm kinda bummed because the same thing happens on my server, I'll try and do a pg_dump there and see if after recovery it works.

rijkvanzanten commented 1 year ago

Ay yeah that makes it super tricky to reproduce and fix then :C

For what it's worth, and I know it doesn't matter for right now, but at least I'm glad to report we're actively working on replacing the "read the whole schema up front and cache it" setup to a more "read the schema when it's needed and cache the smaller pieces" approach, which should resolve this as well. That's not a "done this afternoon, launch tonight" sorta thing though

arpadgabor commented 1 year ago

@rijkvanzanten I managed to reproduce the issue! And looks like consistently. Here's a repro: https://github.com/arpadgabor/directus-bugs. You can check the readme for reproduction steps.

arpadgabor commented 1 year ago

@rijkvanzanten Wondering, as you reopened, have you been able to reproduce this?

rijkvanzanten commented 1 year ago

@arpadgabor No, not yet! I've been absolutely swamped with a million and one things; but now that you've shared that reproduction repo, there should be more information available for me or someone else in the community to take another look! I reopened it mostly to make sure it wouldn't get forgotten 🙂

paescuj commented 1 year ago

Hey, this seems to be fixed in the meantime, at least your repro works now when testing against the latest version (9.24.0). In the last few releases we've published quite some fixes and improvements regarding schema & caching, so chances are high this really isn't any issue anymore. Therefore I'll close this for now, but happy to keep discussing in this thread if you think this is still an issue 👍