DatabaseCleaner / database_cleaner-active_record

Strategies for cleaning databases using ActiveRecord. Can be used to ensure a clean state for testing.
MIT License
61 stars 64 forks source link

ActiveRecord Truncation does not work when using multiple schemas #6

Open abhchand opened 6 years ago

abhchand commented 6 years ago

Hello - thanks for putting together database_cleaner! I've been using the gem for a while and I know maintenance takes effort, so appreciate those who contribute to this project.

I'm using the following setup -

Like many multi-tenant applications, my app utilizes postgres schemas which store parrallel, independent copies of my database tables.

So if I have two tenants (earth, mars) and two tables (users, companies) I would have the following tables in my DB -

schema_migrations
ar_internal_metadata
public.users
public.companies
earth.users
earth.companies
mars.users
mars.companies

When using DatabaseCleaner.strategy = :truncation, it looks for a list of tables in my DB to truncate (see here).

When I run that query against my local test database -

SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
  tablename !~ '_prt_' AND
  tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
  AND schemaname = ANY (current_schemas(false))
;
                 ?column?
-------------------------------------------
 public.users
 public.companies
(2 rows)

As you can see, it only returns the public schemas, because of the schemaname = criteria. This effectively doesn't wipe my non-public schema tables, causing run-over between tests.

I understand that there are options to use a transaction instead, or even the only: [..] parameter to list all my tables. But it seems like this should work as is without having to use one of those as a workaround (and if my DB is quite large, a whitelist can get messy and inefficient across thousands of tests).

Would it make sense to update the query logic to something like

SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
  tablename !~ '_prt_' AND
  tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
  AND schemaname NOT IN ('pg_catalog', 'information_schema')
;

Thanks!

Related to: https://github.com/DatabaseCleaner/database_cleaner/issues/225

fcheung commented 6 years ago

I've come across a similar thing. For now doing

  def with_all_schemas
    old = ActiveRecord::Base.connection.schema_search_path
    begin
      ActiveRecord::Base.connection.schema_search_path = [
        your_list_of_schemas_here
      ].join(',')
      yield
    ensure
      ActiveRecord::Base.connection.schema_search_path = old
    end
  end

and then wrapping calls to Databasecleaner.clean with with_all_schemas seems to do the trick

deathwish commented 5 years ago

Honestly, this is pretty awful.

But the solution above works and can be improved by using

select schema_name from information_schema.schemata where schema_name not like 'information_schema' and schema_name not like 'pg_%'

to fetch all schemas in the active database for cleaning, which is the behavior I expected database_cleaner to exhibit.

alejandro-falkowski-gelato commented 4 years ago

For me as I am using sequel was to use search_path as described here https://github.com/jeremyevans/sequel/blob/master/doc/opening_databases.rdoc#postgres

emad-elsaid commented 4 years ago

My workaround was to just truncate the tables I need myself after each test example:

  config.after(:each) do
    DatabaseCleaner.clean
    ['schema.tablename'].each do |table|
      ActiveRecord::Base.connection.execute("TRUNCATE #{table}")
    end
  end
botandrose commented 4 years ago

Hi, so I'd love to help get this problem solved, but honestly I'm not familiar with with the concept of multiple schemas in PostgreSQL, so I'm going to need some help understanding. Can someone explain why an app might have more than one schema?

Also, about the current clause that is causing the issue: schemaname = ANY (current_schemas(false)): Looking in the git history, the commit that added this line has the following to say:

commit 77bca129d6031571b4da1edf63b9f5ae93fc6ee2
Author: Billy Watson <billy@korrelate.com>
Date:   Tue Jul 22 12:31:03 2014 -0400

    always return Postgres table names with schema to avoid awfulness

    - without the schema, the migrations table gets truncated since the default AR method returns ‘public.schema_migrations’
    - also a possibility is truncation of tables other than desired once since without the schema, Postgres has no way of knowing which table to truncate if two schemas have the same table

Can someone unpack this for me?

Finally, in the OP @abhchand has suggested the possible solution of replacing the line with schemaname NOT IN ('pg_catalog', 'information_schema'). Can I get some opinions on this suggestion? Would it address the problem exhibited here, while still addressing the concerns that Billy Watson was trying to solve in his commit?

Adamantish commented 4 years ago

Hiya @botandrose . The reason we're using multiple schemata is to do close integration with a proprietary 3rd party application. There are some good reasons not to do it this way (it's a microservices sin and I feel a little bit ill about it) but amongst the advantages we find are...

I think a general solution which would ensure truncate strategy works might be to count records in all tables everywhere at the start of the whole test suite and truncate all that had zero records.

erlinis commented 4 years ago

There are many reasons why some apps use several schemas, but for sure it is a thing.

What Billy Watson was trying to solve is totally fine, but there is one piece in his solution that is not working as as he is expecting (current_schemas(false), even when the database has multiple schemas current_schemas(false) is only returning {public} because it depends what is set in the search_path.

select current_schemas(false);

 current_schemas
-----------------
 {public}

While if you execute the query suggested by @deathwish you will indeed get all your schemas (excluding the Postgres' ones of course)

 SELECT schema_name FROM information_schema.schemata WHERE schema_name not like 'information_schema' AND schema_name not like 'pg_%';

 schema_name
-------------
 public
 archive

So, combining the query above with the current code, it will solve the issue while still addressing the concern that Billy Watson was trying to solve since it now lists the tables in all schemas and still appending the schema name.

SELECT schemaname || '.' || tablename as table_name
  FROM pg_tables
 WHERE tablename !~ '_prt_'  
   AND tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata' 
   AND schemaname = ANY (
        SELECT schema_name 
         FROM information_schema.schemata 
        WHERE schema_name not like 'information_schema' 
          AND schema_name not like 'pg_%'
       );

                table_name
-----------------------------------------
 public.users
 public.questions
 public.responses
 public.versions
 archive.logs
 archive.versions
erlinis commented 4 years ago

Or even simpler as @alejandro-falkowski-gelato suggested, setting up the search_path in the database configuration will in fact lists all the desired schemas when running current_schemas(false)

test:
  adapter: postgresql
  host: localhost
  database: db_test
  ...
  schema_search_path: 'schema1,public'

So it will worth at least mention in the docs that you should set the search_path if you expect DatabaseCleaner also truncate the tables in all schemas rather than only public

botandrose commented 4 years ago

Thank you, @Adamantish, for the description of your use-case... I think I have a good handle on this now! And thank you, @erlinis, for your superb detective work and proposed solution, which appears to me to be solid.

Is there any reason why one might not want to clean all tables outside of the public schema? Let's say we were to fold this change into a hypothetical v1.8.5 bugfix release of database_cleaner. Would this, in fact, be a breaking change for some users? In that case, would it be better to release it in the form of a new option to the truncation strategy, like all_schemas: true etc?

erlinis commented 4 years ago

Hi @botandrose, A new option to the truncation strategy sounds great, but instead of all_schemas: true/false, I think would be better to have something like included_schemas: [ 'public', 'schema1'] and if nothing is specified ['public'] is the default value.

This way current behavior won't be affected and the developer will have more control over what clean.

botandrose commented 4 years ago

@erlinis I like it! What do others in this thread think? Are you game for taking a stab at it in a PR?

deathwish commented 4 years ago

Like the original submitter's per-tenant schema use case, I'd definitely prefer to be able to clean all schemata rather than saying which ones should be cleaned. That said, I agree that being able to specify which schemata are cleaned is also important.

We are in dynamic-language land here, so why not both? Something like schemata: :all is not incompatible with schemata: ['public', 'schema1'], and allows the later addition of additional functionality like schemata: {except: 'schema2'}.

The only other note I have immediately is that this should interact correctly with the cache: true|false option as documented, which also implies that it should be possible to have a non-fixed list of schemata to be cleaned.

botandrose commented 4 years ago

@deathwish All great points! Like you, I think :except can be punted on into its own subsequent PR, and agree that this functionality should provide an :all option, for those who don't want to enumerate every schema. Also agree on respecting the :cache option. Thank you for pointing these things out!

jherdman commented 2 years ago

Is this issue still relevant with the latest version of this gem?

timscott commented 1 year ago

@jherdman - I am on the latest version (2.0.1) and can confirm it's still an issue. @botandrose, anything in the works?

timscott commented 1 year ago

This line seems to be the problem.

@only = all_tables.map { |table| table.split(".").last }

For me all_tables contains the following two entries: public.items and events.items. So database cleaner will execute DELETE FROM items twice.

Is there a reason to remove the schema? Seems like a simple bug fix, change this:

all_tables = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables
@only = all_tables.map { |table| table.split(".").last }

To this:

@only = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables
kanmeiban commented 6 months ago

@timscott A few lines below is

@except += connection.database_cleaner_view_cache + migration_storage_names

which should also be turned into FQN for your suggestion to work.