Open HarlanH opened 6 years ago
Hey Harlan, that's a great request! My understanding is that the best way to verify whether something works with redshift is to test against Postgres 8.0. Do you have any sense of whether getting pgbedrock working with Postgres 8.0 is roughly equivalent to getting it working with Redshift, or are there other distinctions between those two systems?
In addition to swapping out pg_authid
, we'd also need to skip the default privilege grants / removals as that wasn't introduced until Postgres 9.0.
In any case, I think the solution is relatively simple: upon connecting to the database, immediately ask what Postgres version we're working with. For default privileges we then just skip that block of code if we're in an older version. With regard to pg_authid
, I'm curious whether we can just use pg_shadow
/ pg_group
permanently or if we will need to similarly branch. Until we tinker with Postgres 8.0 it likely won't be clear.
As far as specific testing process, I think this should be pretty straightforward (at least to describe in words): add a Postgres 8.0 docker container and run our test suite against it. We'll need to mark tests to skip if we're testing the older Postgres version (e.g. default privileges), and we may have to even split out some existing tests to account for the expected differing behavior. As an example, we currently test a grant to myschema.*
and make sure that a grant happens for all tables as well as a default privilege grant occurs. We'd need to skip that test for Postgres 8.0 and have a new test that asserts all the same things except for the default privilege grant.
Assuming that Postgres 8.0 is functionally equivalent to Redshift, if we can get all tests to pass then we should be in a good place, but until someone actually tests against a Redshift I guess it's still a bit of a guess.
We don't actually use Redshift at Squarespace so I'm not sure how quickly this would get prioritized on my end, but I'm also happy to sync up and offer guidance on how to try implementing this if it's higher priority on your end.
After poking around a bit more, it looks like Redshift does support default privileges as per AWS's documentation here, yet Postgres 8.0.2 definitely does not.
First, the Postgres release notes first mention default privileges in 9.0. Additionally, when I try running a Postgres 8.0.2 docker instance I can't get an ALTER DEFAULT PRIVILEGES
statement to run. Moreover, within that container there doesn't appear to be a pg_catalog.pg_default_acl
table, which is how default privileges are maintained (at least in Postgres 9+).
For documentation purposes, I looked into this by getting a local Postgres 8.0.2 running via:
docker run -p 5439:5439 --name local_redshift guildeducation/docker-amazon-redshift
I then tested default privileges with:
CREATE USER test_user;
CREATE SCHEMA test_schema;
ALTER DEFAULT PRIVILEGES IN SCHEMA test_schema GRANT SELECT ON TABLES TO test_user;
I then searched for the default ACL table via:
SELECT *
FROM information_schema.tables
WHERE table_schema = 'pg_catalog'
ORDER BY table_name;
The takeaway from this is that I'm not really sure how to properly test redshift then: my thought was that we could spin up a Postgres 8.0.2 container and operate against that, but if that doesn't match Redshift (which based on the above it does not), then I don't know how to reliably test redshift without having a live redshift and tests that rely on network connectivity, which feels very awkward. Maybe you have some ideas though Harlan.
Hi Zach, OK, great. I think that managing permissions is an even bigger problem on analytical databases like Redshift than on operational databases. That's certainly true for us.
Yes, Redshift has backported a bunch of stuff from post 8.0.2, including the default privs stuff. I don't think that you'll be able to test Redshift without an actual Redshift instance. I wonder how other open source projects handle testing against non-free databases?
Hey Harlan, I looked around a bit and it seems like people just create another redshift instance and test on that, so that seems like the best path forward. I'm hoping because there isn't much data being added / removed that the cost element would be pretty small since I don't want to discourage people from testing.
I'm hoping that since Redshift appears to support default privileges that there isn't much machinery here that will have to change, i.e. perhaps it's just that the system catalogs are still organized the way Postgres used to be and it's just a few tables that are missing. In that vein, could you try running the below query and tell me what comes back? It's just listing out which of the pg_catalog tables that pgbedrock relies on are missing within Redshift:
SELECT
desired_tables.table_name AS missing_tables
FROM (
VALUES
('pg_auth_members'),
('pg_authid'),
('pg_class'),
('pg_default_acl'),
('pg_depend'),
('pg_namespace'),
('pg_roles')
) AS desired_tables (table_name)
LEFT JOIN information_schema.tables info_tables
ON desired_tables.table_name = info_tables.table_name
AND info_tables.table_schema = 'pg_catalog'
WHERE
info_tables.table_name IS NULL
;
Yeah -- a single-node dc2.large instance is 25 cents an hour, so if there was a spin-up/test/spin-down script for Redshift, it'd be no big deal...
That query doesn't run because Redshift doesn't have VALUES
. This query:
SELECT table_name,
table_type,
table_name in ('pg_auth_members', 'pg_authid', 'pg_class', 'pg_default_acl', 'pg_depend', 'pg_namespace', 'pg_roles') as present
from information_schema.tables info_tables
WHERE info_tables.table_schema = 'pg_catalog'
order by 3 desc
yields:
pg_default_acl BASE TABLE t
pg_class BASE TABLE t
pg_namespace BASE TABLE t
pg_depend BASE TABLE t
pg_conf BASE TABLE f
pg_library BASE TABLE f
pg_shdepend BASE TABLE f
pg_statistic BASE TABLE f
pg_type BASE TABLE f
pg_attribute BASE TABLE f
pg_tablespace BASE TABLE f
pg_inherits BASE TABLE f
pg_index BASE TABLE f
pg_operator BASE TABLE f
pg_opclass BASE TABLE f
pg_am BASE TABLE f
pg_amop BASE TABLE f
pg_amproc BASE TABLE f
pg_language BASE TABLE f
pg_largeobject BASE TABLE f
pg_aggregate BASE TABLE f
pg_statistic_multicol BASE TABLE f
pg_trigger BASE TABLE f
pg_listener BASE TABLE f
pg_cast BASE TABLE f
pg_conversion BASE TABLE f
pg_udf BASE TABLE f
pg_bar_repos BASE TABLE f
pg_bar_state BASE TABLE f
pg_highwatermark BASE TABLE f
pg_resize BASE TABLE f
pg_attrdef BASE TABLE f
pg_constraint BASE TABLE f
pg_database BASE TABLE f
pg_description BASE TABLE f
pg_group BASE TABLE f
pg_proc BASE TABLE f
pg_rewrite BASE TABLE f
pg_bar_ddllog BASE TABLE f
pg_test BASE TABLE f
pg_statistic_indicator BASE TABLE f
pg_user VIEW f
pg_rules VIEW f
pg_views VIEW f
pg_tables VIEW f
pg_table_def VIEW f
pg_indexes VIEW f
pg_stats VIEW f
pg_stat_all_tables VIEW f
pg_stat_sys_tables VIEW f
pg_stat_user_tables VIEW f
pg_statio_all_tables VIEW f
pg_statio_sys_tables VIEW f
pg_statio_user_tables VIEW f
pg_stat_all_indexes VIEW f
pg_stat_sys_indexes VIEW f
pg_stat_user_indexes VIEW f
pg_statio_all_indexes VIEW f
pg_statio_sys_indexes VIEW f
pg_statio_user_indexes VIEW f
pg_statio_all_sequences VIEW f
pg_statio_sys_sequences VIEW f
pg_statio_user_sequences VIEW f
pg_stat_activity VIEW f
pg_stat_database VIEW f
pg_locks VIEW f
pg_user_info VIEW f
pg_database_info VIEW f
pg_settings VIEW f
pg_external_schema BASE TABLE f
stv_wlm_qmr_config BASE TABLE f
stl_rlf_scan BASE TABLE f
stl_internal_query_details BASE TABLE f
svl_awsclient_error VIEW f
svl_s3requests VIEW f
svv_comm_pkt_latency_hist VIEW f
svv_comm_pkt_latency VIEW f
svv_io_latency_read VIEW f
svv_io_latency_sync VIEW f
svv_io_latency_write VIEW f
svv_sem_usage VIEW f
svv_sem_usage_leader VIEW f
svl_user_info VIEW f
padb_config_harvest BASE TABLE f
stl_aggr BASE TABLE f
stl_alert_event_log BASE TABLE f
stl_analyze BASE TABLE f
stl_bcast BASE TABLE f
stl_column_stats BASE TABLE f
stl_commit_stats BASE TABLE f
stl_connection_log BASE TABLE f
stl_ddltext BASE TABLE f
stl_delete BASE TABLE f
stl_disk_full_diag BASE TABLE f
stl_dist BASE TABLE f
stl_error BASE TABLE f
stl_explain BASE TABLE f
stl_file_scan BASE TABLE f
stl_hash BASE TABLE f
stl_hashjoin BASE TABLE f
stl_insert BASE TABLE f
stl_leader_snapshot BASE TABLE f
stl_limit BASE TABLE f
stl_load_commits BASE TABLE f
stl_load_errors BASE TABLE f
stl_load_trace BASE TABLE f
stl_loaderror_detail BASE TABLE f
stl_merge BASE TABLE f
stl_mergejoin BASE TABLE f
stl_metadata_step BASE TABLE f
stl_nestloop BASE TABLE f
stl_parse BASE TABLE f
stl_plan_info BASE TABLE f
stl_project BASE TABLE f
stl_query BASE TABLE f
stl_query_metrics BASE TABLE f
stl_querytext BASE TABLE f
stl_replacements BASE TABLE f
stl_restarted_sessions BASE TABLE f
stl_return BASE TABLE f
stl_rpc BASE TABLE f
stl_s3client BASE TABLE f
stl_s3client_error BASE TABLE f
stl_save BASE TABLE f
stl_scan BASE TABLE f
stl_sessions BASE TABLE f
stl_sort BASE TABLE f
stl_sshclient_error BASE TABLE f
stl_stream_segs BASE TABLE f
stl_tr_conflict BASE TABLE f
stl_udf_compile_error BASE TABLE f
stl_undone BASE TABLE f
stl_unique BASE TABLE f
stl_unload_log BASE TABLE f
stl_userlog BASE TABLE f
stl_utilitytext BASE TABLE f
stl_vacuum BASE TABLE f
stl_vacuum_detail BASE TABLE f
stl_warning BASE TABLE f
stl_window BASE TABLE f
stl_wlm_error BASE TABLE f
stl_wlm_query BASE TABLE f
stl_wlm_rule_action BASE TABLE f
stv_active_cursors BASE TABLE f
stv_block_reps BASE TABLE f
stv_blocklist BASE TABLE f
stv_cursor_configuration BASE TABLE f
stv_disk_extents BASE TABLE f
stv_exec_state BASE TABLE f
stv_inflight BASE TABLE f
stv_interleaved_counts BASE TABLE f
stv_load_state BASE TABLE f
stv_locks BASE TABLE f
stv_partitions BASE TABLE f
stv_pg_wal_length BASE TABLE f
stv_proc_stat BASE TABLE f
stv_query_metrics BASE TABLE f
stv_query_stats BASE TABLE f
stv_recents BASE TABLE f
stv_sessions BASE TABLE f
stv_slices BASE TABLE f
stv_startup_recovery_state BASE TABLE f
stv_tbl_perm BASE TABLE f
stv_tbl_trans BASE TABLE f
stv_underrepped_blocks BASE TABLE f
stv_wlm_classification_config BASE TABLE f
stv_wlm_query_queue_state BASE TABLE f
stv_wlm_query_state BASE TABLE f
stv_wlm_query_task_state BASE TABLE f
stv_wlm_service_class_config BASE TABLE f
stv_wlm_service_class_state BASE TABLE f
svl_compile VIEW f
svl_qlog VIEW f
svl_query_metrics VIEW f
svl_query_metrics_summary VIEW f
svl_query_queue_info VIEW f
svl_query_report VIEW f
svl_query_summary VIEW f
svl_s3catalog VIEW f
svl_s3list VIEW f
svl_s3log VIEW f
svl_s3partition VIEW f
svl_s3partition_summary VIEW f
svl_s3query VIEW f
svl_s3query_summary VIEW f
svl_s3retries VIEW f
svl_statementtext VIEW f
svl_udf_log VIEW f
svl_vacuum_percentage VIEW f
svv_columns VIEW f
svv_diskusage VIEW f
svv_external_columns VIEW f
svv_external_databases VIEW f
svv_external_partitions VIEW f
svv_external_schemas VIEW f
svv_external_tables VIEW f
svv_interleaved_columns VIEW f
svv_query_inflight VIEW f
svv_query_state VIEW f
svv_restore_table_state VIEW f
svv_table_info VIEW f
svv_tables VIEW f
svv_transactions VIEW f
svv_vacuum_progress VIEW f
svv_vacuum_summary VIEW f
systable_globaldict BASE TABLE f
systable_schema BASE TABLE f
systable_topology BASE TABLE f
Interesting. So it looks like it's missing three tables that pgbedrock uses:
pg_roles
- this doesn't really need to be used; we should just swap out the one reference to this for pg_authid
insteadpg_auth_members
- it looks like this could be replaced by pg_group
(Postgres docs)pg_authid
- This looks like the most challenging relation to swap out, for a few reasons:
pg_user
(Postgres docs) differ, so attributes.py
would need to behave differently. DROP USER foo;
and then CREATE GROUP foo;
. Similarly, if someone wanted to modify some attributes of a group role (e.g. give it a connection limit) I think that wouldn't work, so in the Redshift implementation we'd have to put additional schema validation in against specs to explain what is and isn't possible.pg_authid
, but now we'd need to decide how to work across two tables: looking up non-login roles in pg_group
and looking up login roles in pg_user
.In addition, we'd need to figure out how to test against Redshift, i.e. potentially add custom markers listing tests that should only run on Redshift and others that should only run on Postgres, then add the ability to swap out the test URL so it can hit a real Redshift.
Again, I'm not sure how highly prioritized this will get at Squarespace as we don't really use Redshift, but it seems like this is doable if someone who uses Redshift would want to take the lead on trying to flesh this out with my help.
Huge thanks for doing this research, Zach! I think this gives a big head-start to whoever can devote some time to this. (Hoping my colleague @hanleyhansen might be interested...? :) )
So I'm hacking away at this with the goal to make my forked repo Redshift-compatible (but in the process, I'm losing Postgres compatibility).
I'm pretty new to python, so I'm using this as an exercise in teaching myself, since the subject matter (Redshift user permissions) is very familiar.
As such, I don't think any of the code that I produce will be pull-request-standard. If I get my forked repo working for Redshift, I'm happy to share it around, but I don't feel I'll be the right person to introduce multi-database compatibility. Nonetheless, what I'm learning along the way will be very valuable to the person that does take this issue on.
I haven't got it working yet, and I don't imagine I'll get another chance to look at it for at least a week, so just wanted to share my current progress in case someone is looking at this soon.
So far the noteworthy challenges are:
pg_auth_members
, instead pg_groups
stores group memberships as an array, returning tuples of (group, members)
, e.g.:groname | grosysid | grolist |
---|---|---|
webpowerusers | 101 | |
webdevusers | 102 | |
webappusers | 100 | 102, 103 |
grolist
is an array (according to information_schema.columns
), which is weird, because arrays aren't supported in Redshiftpg_user.usesysid
and pg_group.gro_list
to join them together (yuk!), but I couldn't do Regex on the array type. And then because arrays are not a supported data type, it won't let me cast it to a string ¯_(ツ)_/¯pgbedrock
as arrays, and then I'll have to use python to unnest the array, to get the tuples of (member, group)
.grant
ing membership of a group to a role (user), in Redshift, you have to alter group "group_name" add user "user_name"
- a bit strange but nothing too bad.users
and groups
as separate objects, compared to postgres which uses roles
.can_login:
in the config with a required field, role_type: (user/group)
.role_type == group
, certain configurations cannot exist (e.g. for owns:
, role_type
must equal user
, since group
s cannot own tables/schemas in Redshift).createuser
privilege, so createuser
should not be able to be defined in attributes
@clrcrl have you made any progress on this? I might be interested in collaborating.
I'm pretty late to the party here, but have there been any developments with Redshift compatibility?
+1
Redshift is based on Postgres 8.0.2, so a few things are different. Maybe this issue should be a list of known issues?
pg_authid does not exist
It looks like it was introduced in 8.1: