support Redshift - Githubissues

HarlanH commented 6 years ago

Redshift is based on Postgres 8.0.2, so a few things are different. Maybe this issue should be a list of known issues?

pg_authid does not exist

Failed to execute query "SELECT * FROM pg_authid WHERE rolname != 'pg_signal_backend';": relation "pg_authid" does not exist

It looks like it was introduced in 8.1:

This change also replaces pg_shadow and pg_group by new role-capable catalogs pg_authid and pg_auth_members.

zcmarine commented 6 years ago

Hey Harlan, that's a great request! My understanding is that the best way to verify whether something works with redshift is to test against Postgres 8.0. Do you have any sense of whether getting pgbedrock working with Postgres 8.0 is roughly equivalent to getting it working with Redshift, or are there other distinctions between those two systems?

In addition to swapping out pg_authid, we'd also need to skip the default privilege grants / removals as that wasn't introduced until Postgres 9.0.

In any case, I think the solution is relatively simple: upon connecting to the database, immediately ask what Postgres version we're working with. For default privileges we then just skip that block of code if we're in an older version. With regard to pg_authid, I'm curious whether we can just use pg_shadow / pg_group permanently or if we will need to similarly branch. Until we tinker with Postgres 8.0 it likely won't be clear.

As far as specific testing process, I think this should be pretty straightforward (at least to describe in words): add a Postgres 8.0 docker container and run our test suite against it. We'll need to mark tests to skip if we're testing the older Postgres version (e.g. default privileges), and we may have to even split out some existing tests to account for the expected differing behavior. As an example, we currently test a grant to myschema.* and make sure that a grant happens for all tables as well as a default privilege grant occurs. We'd need to skip that test for Postgres 8.0 and have a new test that asserts all the same things except for the default privilege grant.

Assuming that Postgres 8.0 is functionally equivalent to Redshift, if we can get all tests to pass then we should be in a good place, but until someone actually tests against a Redshift I guess it's still a bit of a guess.

We don't actually use Redshift at Squarespace so I'm not sure how quickly this would get prioritized on my end, but I'm also happy to sync up and offer guidance on how to try implementing this if it's higher priority on your end.

zcmarine commented 6 years ago

After poking around a bit more, it looks like Redshift does support default privileges as per AWS's documentation here, yet Postgres 8.0.2 definitely does not.

First, the Postgres release notes first mention default privileges in 9.0. Additionally, when I try running a Postgres 8.0.2 docker instance I can't get an ALTER DEFAULT PRIVILEGES statement to run. Moreover, within that container there doesn't appear to be a pg_catalog.pg_default_acl table, which is how default privileges are maintained (at least in Postgres 9+).

For documentation purposes, I looked into this by getting a local Postgres 8.0.2 running via:

docker run -p 5439:5439 --name local_redshift guildeducation/docker-amazon-redshift

I then tested default privileges with:

CREATE USER test_user;
CREATE SCHEMA test_schema;
ALTER DEFAULT PRIVILEGES IN SCHEMA test_schema GRANT SELECT ON TABLES TO test_user;

I then searched for the default ACL table via:

SELECT * 
FROM information_schema.tables 
WHERE table_schema = 'pg_catalog' 
ORDER BY table_name;

The takeaway from this is that I'm not really sure how to properly test redshift then: my thought was that we could spin up a Postgres 8.0.2 container and operate against that, but if that doesn't match Redshift (which based on the above it does not), then I don't know how to reliably test redshift without having a live redshift and tests that rely on network connectivity, which feels very awkward. Maybe you have some ideas though Harlan.

HarlanH commented 6 years ago

Hi Zach, OK, great. I think that managing permissions is an even bigger problem on analytical databases like Redshift than on operational databases. That's certainly true for us.

Yes, Redshift has backported a bunch of stuff from post 8.0.2, including the default privs stuff. I don't think that you'll be able to test Redshift without an actual Redshift instance. I wonder how other open source projects handle testing against non-free databases?

zcmarine commented 6 years ago

Hey Harlan, I looked around a bit and it seems like people just create another redshift instance and test on that, so that seems like the best path forward. I'm hoping because there isn't much data being added / removed that the cost element would be pretty small since I don't want to discourage people from testing.

I'm hoping that since Redshift appears to support default privileges that there isn't much machinery here that will have to change, i.e. perhaps it's just that the system catalogs are still organized the way Postgres used to be and it's just a few tables that are missing. In that vein, could you try running the below query and tell me what comes back? It's just listing out which of the pg_catalog tables that pgbedrock relies on are missing within Redshift:

SELECT
    desired_tables.table_name AS missing_tables
FROM (
    VALUES
        ('pg_auth_members'),
        ('pg_authid'),
        ('pg_class'),
        ('pg_default_acl'),
        ('pg_depend'),
        ('pg_namespace'),
        ('pg_roles')
    ) AS desired_tables (table_name)
    LEFT JOIN information_schema.tables info_tables
        ON desired_tables.table_name = info_tables.table_name
        AND info_tables.table_schema = 'pg_catalog'
WHERE
    info_tables.table_name IS NULL
;

HarlanH commented 6 years ago

Yeah -- a single-node dc2.large instance is 25 cents an hour, so if there was a spin-up/test/spin-down script for Redshift, it'd be no big deal...

That query doesn't run because Redshift doesn't have VALUES. This query:

SELECT table_name, 
    table_type,
    table_name in ('pg_auth_members', 'pg_authid', 'pg_class', 'pg_default_acl', 'pg_depend', 'pg_namespace', 'pg_roles') as present
from information_schema.tables info_tables
WHERE info_tables.table_schema = 'pg_catalog'
order by 3 desc

yields:

pg_default_acl  BASE TABLE  t
pg_class    BASE TABLE  t
pg_namespace    BASE TABLE  t
pg_depend   BASE TABLE  t
pg_conf BASE TABLE  f
pg_library  BASE TABLE  f
pg_shdepend BASE TABLE  f
pg_statistic    BASE TABLE  f
pg_type BASE TABLE  f
pg_attribute    BASE TABLE  f
pg_tablespace   BASE TABLE  f
pg_inherits BASE TABLE  f
pg_index    BASE TABLE  f
pg_operator BASE TABLE  f
pg_opclass  BASE TABLE  f
pg_am   BASE TABLE  f
pg_amop BASE TABLE  f
pg_amproc   BASE TABLE  f
pg_language BASE TABLE  f
pg_largeobject  BASE TABLE  f
pg_aggregate    BASE TABLE  f
pg_statistic_multicol   BASE TABLE  f
pg_trigger  BASE TABLE  f
pg_listener BASE TABLE  f
pg_cast BASE TABLE  f
pg_conversion   BASE TABLE  f
pg_udf  BASE TABLE  f
pg_bar_repos    BASE TABLE  f
pg_bar_state    BASE TABLE  f
pg_highwatermark    BASE TABLE  f
pg_resize   BASE TABLE  f
pg_attrdef  BASE TABLE  f
pg_constraint   BASE TABLE  f
pg_database BASE TABLE  f
pg_description  BASE TABLE  f
pg_group    BASE TABLE  f
pg_proc BASE TABLE  f
pg_rewrite  BASE TABLE  f
pg_bar_ddllog   BASE TABLE  f
pg_test BASE TABLE  f
pg_statistic_indicator  BASE TABLE  f
pg_user VIEW    f
pg_rules    VIEW    f
pg_views    VIEW    f
pg_tables   VIEW    f
pg_table_def    VIEW    f
pg_indexes  VIEW    f
pg_stats    VIEW    f
pg_stat_all_tables  VIEW    f
pg_stat_sys_tables  VIEW    f
pg_stat_user_tables VIEW    f
pg_statio_all_tables    VIEW    f
pg_statio_sys_tables    VIEW    f
pg_statio_user_tables   VIEW    f
pg_stat_all_indexes VIEW    f
pg_stat_sys_indexes VIEW    f
pg_stat_user_indexes    VIEW    f
pg_statio_all_indexes   VIEW    f
pg_statio_sys_indexes   VIEW    f
pg_statio_user_indexes  VIEW    f
pg_statio_all_sequences VIEW    f
pg_statio_sys_sequences VIEW    f
pg_statio_user_sequences    VIEW    f
pg_stat_activity    VIEW    f
pg_stat_database    VIEW    f
pg_locks    VIEW    f
pg_user_info    VIEW    f
pg_database_info    VIEW    f
pg_settings VIEW    f
pg_external_schema  BASE TABLE  f
stv_wlm_qmr_config  BASE TABLE  f
stl_rlf_scan    BASE TABLE  f
stl_internal_query_details  BASE TABLE  f
svl_awsclient_error VIEW    f
svl_s3requests  VIEW    f
svv_comm_pkt_latency_hist   VIEW    f
svv_comm_pkt_latency    VIEW    f
svv_io_latency_read VIEW    f
svv_io_latency_sync VIEW    f
svv_io_latency_write    VIEW    f
svv_sem_usage   VIEW    f
svv_sem_usage_leader    VIEW    f
svl_user_info   VIEW    f
padb_config_harvest BASE TABLE  f
stl_aggr    BASE TABLE  f
stl_alert_event_log BASE TABLE  f
stl_analyze BASE TABLE  f
stl_bcast   BASE TABLE  f
stl_column_stats    BASE TABLE  f
stl_commit_stats    BASE TABLE  f
stl_connection_log  BASE TABLE  f
stl_ddltext BASE TABLE  f
stl_delete  BASE TABLE  f
stl_disk_full_diag  BASE TABLE  f
stl_dist    BASE TABLE  f
stl_error   BASE TABLE  f
stl_explain BASE TABLE  f
stl_file_scan   BASE TABLE  f
stl_hash    BASE TABLE  f
stl_hashjoin    BASE TABLE  f
stl_insert  BASE TABLE  f
stl_leader_snapshot BASE TABLE  f
stl_limit   BASE TABLE  f
stl_load_commits    BASE TABLE  f
stl_load_errors BASE TABLE  f
stl_load_trace  BASE TABLE  f
stl_loaderror_detail    BASE TABLE  f
stl_merge   BASE TABLE  f
stl_mergejoin   BASE TABLE  f
stl_metadata_step   BASE TABLE  f
stl_nestloop    BASE TABLE  f
stl_parse   BASE TABLE  f
stl_plan_info   BASE TABLE  f
stl_project BASE TABLE  f
stl_query   BASE TABLE  f
stl_query_metrics   BASE TABLE  f
stl_querytext   BASE TABLE  f
stl_replacements    BASE TABLE  f
stl_restarted_sessions  BASE TABLE  f
stl_return  BASE TABLE  f
stl_rpc BASE TABLE  f
stl_s3client    BASE TABLE  f
stl_s3client_error  BASE TABLE  f
stl_save    BASE TABLE  f
stl_scan    BASE TABLE  f
stl_sessions    BASE TABLE  f
stl_sort    BASE TABLE  f
stl_sshclient_error BASE TABLE  f
stl_stream_segs BASE TABLE  f
stl_tr_conflict BASE TABLE  f
stl_udf_compile_error   BASE TABLE  f
stl_undone  BASE TABLE  f
stl_unique  BASE TABLE  f
stl_unload_log  BASE TABLE  f
stl_userlog BASE TABLE  f
stl_utilitytext BASE TABLE  f
stl_vacuum  BASE TABLE  f
stl_vacuum_detail   BASE TABLE  f
stl_warning BASE TABLE  f
stl_window  BASE TABLE  f
stl_wlm_error   BASE TABLE  f
stl_wlm_query   BASE TABLE  f
stl_wlm_rule_action BASE TABLE  f
stv_active_cursors  BASE TABLE  f
stv_block_reps  BASE TABLE  f
stv_blocklist   BASE TABLE  f
stv_cursor_configuration    BASE TABLE  f
stv_disk_extents    BASE TABLE  f
stv_exec_state  BASE TABLE  f
stv_inflight    BASE TABLE  f
stv_interleaved_counts  BASE TABLE  f
stv_load_state  BASE TABLE  f
stv_locks   BASE TABLE  f
stv_partitions  BASE TABLE  f
stv_pg_wal_length   BASE TABLE  f
stv_proc_stat   BASE TABLE  f
stv_query_metrics   BASE TABLE  f
stv_query_stats BASE TABLE  f
stv_recents BASE TABLE  f
stv_sessions    BASE TABLE  f
stv_slices  BASE TABLE  f
stv_startup_recovery_state  BASE TABLE  f
stv_tbl_perm    BASE TABLE  f
stv_tbl_trans   BASE TABLE  f
stv_underrepped_blocks  BASE TABLE  f
stv_wlm_classification_config   BASE TABLE  f
stv_wlm_query_queue_state   BASE TABLE  f
stv_wlm_query_state BASE TABLE  f
stv_wlm_query_task_state    BASE TABLE  f
stv_wlm_service_class_config    BASE TABLE  f
stv_wlm_service_class_state BASE TABLE  f
svl_compile VIEW    f
svl_qlog    VIEW    f
svl_query_metrics   VIEW    f
svl_query_metrics_summary   VIEW    f
svl_query_queue_info    VIEW    f
svl_query_report    VIEW    f
svl_query_summary   VIEW    f
svl_s3catalog   VIEW    f
svl_s3list  VIEW    f
svl_s3log   VIEW    f
svl_s3partition VIEW    f
svl_s3partition_summary VIEW    f
svl_s3query VIEW    f
svl_s3query_summary VIEW    f
svl_s3retries   VIEW    f
svl_statementtext   VIEW    f
svl_udf_log VIEW    f
svl_vacuum_percentage   VIEW    f
svv_columns VIEW    f
svv_diskusage   VIEW    f
svv_external_columns    VIEW    f
svv_external_databases  VIEW    f
svv_external_partitions VIEW    f
svv_external_schemas    VIEW    f
svv_external_tables VIEW    f
svv_interleaved_columns VIEW    f
svv_query_inflight  VIEW    f
svv_query_state VIEW    f
svv_restore_table_state VIEW    f
svv_table_info  VIEW    f
svv_tables  VIEW    f
svv_transactions    VIEW    f
svv_vacuum_progress VIEW    f
svv_vacuum_summary  VIEW    f
systable_globaldict BASE TABLE  f
systable_schema BASE TABLE  f
systable_topology   BASE TABLE  f

zcmarine commented 6 years ago

Interesting. So it looks like it's missing three tables that pgbedrock uses:

pg_roles - this doesn't really need to be used; we should just swap out the one reference to this for pg_authid instead
pg_auth_members - it looks like this could be replaced by pg_group (Postgres docs)
pg_authid - This looks like the most challenging relation to swap out, for a few reasons:
1. The attributes that are available via pg_user (Postgres docs) differ, so attributes.py would need to behave differently.
2. Group roles in much older versions of Postgres behaved differently: they didn't have attributes themselves and I think (?) were just there to manage privileges exclusively. There'd be a bit of a ramp-up making sure we understand how to use group roles properly in this other context.
3. We'd need to handle login and non-login roles separately. As an example, changing the login attribute for a role would likely mean we'd have to DROP USER foo; and then CREATE GROUP foo;. Similarly, if someone wanted to modify some attributes of a group role (e.g. give it a connection limit) I think that wouldn't work, so in the Redshift implementation we'd have to put additional schema validation in against specs to explain what is and isn't possible.
4. The most trivial issue: Currently we look up all roles in pg_authid, but now we'd need to decide how to work across two tables: looking up non-login roles in pg_group and looking up login roles in pg_user.

In addition, we'd need to figure out how to test against Redshift, i.e. potentially add custom markers listing tests that should only run on Redshift and others that should only run on Postgres, then add the ability to swap out the test URL so it can hit a real Redshift.

Again, I'm not sure how highly prioritized this will get at Squarespace as we don't really use Redshift, but it seems like this is doable if someone who uses Redshift would want to take the lead on trying to flesh this out with my help.

HarlanH commented 6 years ago

Huge thanks for doing this research, Zach! I think this gives a big head-start to whoever can devote some time to this. (Hoping my colleague @hanleyhansen might be interested...? :) )

clrcrl commented 6 years ago

So I'm hacking away at this with the goal to make my forked repo Redshift-compatible (but in the process, I'm losing Postgres compatibility).

I'm pretty new to python, so I'm using this as an exercise in teaching myself, since the subject matter (Redshift user permissions) is very familiar.

As such, I don't think any of the code that I produce will be pull-request-standard. If I get my forked repo working for Redshift, I'm happy to share it around, but I don't feel I'll be the right person to introduce multi-database compatibility. Nonetheless, what I'm learning along the way will be very valuable to the person that does take this issue on.

I haven't got it working yet, and I don't imagine I'll get another chance to look at it for at least a week, so just wanted to share my current progress in case someone is looking at this soon.

So far the noteworthy challenges are:

Group memberships

Group memberships are completely different in Redshift, and quite a challenge
- There's no equivalent table to pg_auth_members, instead pg_groups stores group memberships as an array, returning tuples of (group, members), e.g.:

groname	grosysid	grolist
webpowerusers	101
webdevusers	102
webappusers	100	102, 103

grolist is an array (according to information_schema.columns), which is weird, because arrays aren't supported in Redshift
It's very difficult to unnest arrays (which are normally stored as strings) in Redshift, without using some sort of pre-made sequence of numbers (see this tutorial)
To work around this, I considered doing a Regex match between pg_user.usesysid and pg_group.gro_list to join them together (yuk!), but I couldn't do Regex on the array type. And then because arrays are not a supported data type, it won't let me cast it to a string ¯_(ツ)_/¯
So it seems to me that for Redshift compatibility, the group memberships will have to be returned to pgbedrock as arrays, and then I'll have to use python to unnest the array, to get the tuples of (member, group).
Also note that rather than granting membership of a group to a role (user), in Redshift, you have to alter group "group_name" add user "user_name" - a bit strange but nothing too bad.

Users & Groups versus Roles

As you've already realised, Redshift treats users and groups as separate objects, compared to postgres which uses roles.
In working through this, I decided it would be easier to replace can_login: in the config with a required field, role_type: (user/group).
I then implemented some extra Cerberus validation so that if role_type == group, certain configurations cannot exist (e.g. for owns:, role_type must equal user, since groups cannot own tables/schemas in Redshift).

Passwords

Passwords are required for Redshift, and have constraints on length/character-types
This is compared to Postgres where they are optional

Superusers

The concept of a superuser in Redshift is slightly different to postgres - basically a superuser is just someone with the createuser privilege, so createuser should not be able to be defined in attributes

mikekaminsky commented 6 years ago

@clrcrl have you made any progress on this? I might be interested in collaborating.

JoeNaso commented 5 years ago

I'm pretty late to the party here, but have there been any developments with Redshift compatibility?

ghost commented 4 years ago

+1

Squarespace / pgbedrock

support Redshift #9

pg_authid does not exist

Group memberships

Users & Groups versus Roles

Passwords

Superusers