LD4P / sinopia

LD4P Sinopia Project repo to hold docs, general issues, schemas, and related spec docs.
https://ld4p.github.io/sinopia/
19 stars 3 forks source link

Ensure ACL auto-update happens (was: Sign up for Cognito accounts) #224

Closed mjgiarlo closed 5 years ago

mjgiarlo commented 5 years ago

Please sign up for the three env-specific accounts below your name and check them off as you go.

When each person has checked off all their boxes, please ping @mjgiarlo with the username you've selected.

jcoyne commented 5 years ago

I'm jcoyne

peetucket commented 5 years ago

I'm petucket (yes, different than my github username, but the same as my sunet which had to be shorter)

justinlittman commented 5 years ago

jlittman

mjgiarlo commented 5 years ago

@aaron-collier is amcollie

mjgiarlo commented 5 years ago

@rsmith11 I looked at the root Trellis ACL resource in all three environments, and found the following:

rsmith11 commented 5 years ago

The image running in development has been running since 5/28, so I don't think CI is working on the sinopia_acl code. I can do a redeploy in AWS which should update the image, then manually run the migrate script to pickup the changes.

mjgiarlo commented 5 years ago

@rsmith11 Good :eye:. In fact, the sinopia_acl CircleCI config does not refresh the dev infrastructure. Cool if we set that up? Do we want to do that, @LD4P/sinopia?

rsmith11 commented 5 years ago

As for production, the sinopia-devs-client-tester account didn't exist. I created it, but I had to create a new password because the same password from dev/stage doesn't conform to the production password policy.

rsmith11 commented 5 years ago

All 3 environments have had the task updated to the latest, and the acl task manually run.

mjgiarlo commented 5 years ago

@rsmith11 Huh, that's weird. Still didn't pick up the new accounts. I'm going to try to run bin/migrate manually against dev to see if that helps troubleshoot.

mjgiarlo commented 5 years ago

@rsmith11 When I run the task (from my laptop) against the dev environment, it appears to succeed, but on subsequent requests (to view the root container ACLs), I get an HTTP 403 error. cc: @LD4P/sinopia

mjgiarlo commented 5 years ago

I'm going to check back on this next Monday after the scheduled task runs again.

mjgiarlo commented 5 years ago

@rsmith11 Here's where things stand on this issue.

The root ACLs in dev look good, and the weekly migrate task is running without error. The same is not true in stage or prod, both of which give me a 403 error. In CloudWatch logs for staging Trellis, the error is:

User: https://cognito-idp.us-west-2.amazonaws.com/us-west-2_ilMQW0M0R/711cc7fd-cb40-42c5-b768-e2af0acc27a6 cannot Control

That WebID does indeed correspond to my staging Cognito account. But the URI does not match what's in the staging Trellis instance, which is of the form: https://sinopia-staging.auth.us-west-2.amazoncognito.com/us-west-2_ilMQWM0R/711cc7fd-cb40-42c5-b768-e2af0acc27a6

Whereas dev trellis is using the correct form of the webid (with cognito-idp.us-west-2.amazonaws.com as the hostname, not sinopia-staging.auth.us-west-2.amazoncognito.com)

I think this means we're effectively locked out of stage and prod.

We can hand-fix the values in stage's PG instance underlying Trellis, I think. Not sure how best to handle prod, or if this is the right strategy for stage.

Additionally, the weekly ACL migration task is erroring in stage with WARN [2019-07-07 04:01:58,880] org.trellisldp.http.WebAcFilter: User: <http://www.trellisldp.org/ns/trellis#AnonymousAgent> cannot Write so we can't rely on it fixing us up. (And I'm not entirely sure why Trellis is seeing that request as an anonymous agent. Should be the cognito admin user, which appears to be working OK in dev.)

Ideally, @rsmith11, you and I'd be able to pair on this but our vacations overlap, so the soonest we could do this is 7/29. I suppose this can wait until we're both back in the office, though? It's not blocking or breaking any other work as far as I can tell. Alternatively, someone else on the team can pair w/ you on this while I'm out. I think we should have more coverage on these back-end issues. Cc: @jmartin-sul @aaron-collier @jermnelson

jmartin-sul commented 5 years ago

@mjgiarlo, @rsmith11, @jermnelson: i could probably help troubleshoot this, and would at least be happy to try, assuming we want to get to it before 7/29. i'm a little less confident in my ability to navigate deployed services on AWS than in my ability to troubleshoot the ACL stuff in general, but maybe @aaron-collier could give me a hand with the AWS part if we wanted to get to this sooner?

mjgiarlo commented 5 years ago

@jmartin-sul I'm happy to pair/clump/whatevs on this tomorrow to at least do knowledge transfer if helpful.

thx!

jmartin-sul commented 5 years ago

attempted to curl the ACL URL from my laptop and got 401 unauthorized regardless of whether i was hitting the URL as an anonymous user or as an admin user (by providing a JWT for an admin user, obtained per the approach described in the sinopia_acl readme). in the latter case, @mjgiarlo pointed out that we'd actually expect a 403 response. so that was slightly surprising.

i then SSHed into the bastion host and used psql to look at the acl table in RDS. per @mjgiarlo's earlier investigation, the WebID URLs in the table don't have the hostname that's actually being sent along by the JWT. so it appears there's still something to be fixed in the way we get the WebID for setting in the ACL.

so i think the next thing to do is to see whether something is wrong with our username to WebID translation, which seems to do the right thing in dev, but not in prod or stage.

jmartin-sul commented 5 years ago

i believe i've remediated the existing webID values in stage to have the correct hostname. i used the following sql:

UPDATE acl
SET object = replace(object, 'https://sinopia-staging.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com')
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';

@rsmith11, i don't think i have production write access, so i think you or someone else in ops will have to run the above from the prod bastion host against the prod RDS instance using e.g. psql? i just opened a psql console and pasted that update statement in and ran it.

jmartin-sul commented 5 years ago

i used the following select statements to test my logic before running the aforementioned update:

# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';

# what things would look like after update
SELECT subject,
  predicate,
  replace(object, 'https://sinopia-staging.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com') AS remediated_object
FROM acl
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';

i guess you could also just combine that into one statement with the unaltered object col from the first, and the remediated_object col from the second.

note that in this schema object is not a complex object, but a string, representing the object of a subject -> predicate -> object triple.

jmartin-sul commented 5 years ago

i've remediated the existing data on prod. that took a slight adjustment to the above sql (which looked for stage specific hostnames). it also took one extra step, to clean up some URLs that started with https://https:// (instead of just https://). i recall @mjgiarlo mentioning that issue, though i don't see mention of that here after a cursory look over the issue (possibly was only mentioned in slack).

the queries and updates i used were:

# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://https://%';

# what things would look like after update
SELECT subject,
  predicate,
  replace(object, 'https://https://', 'https://') AS remediated_object
FROM acl
WHERE object like 'https://https://%';

# get rid of extra scheme portion in URL
UPDATE acl
SET object = replace(object, 'https://https://', 'https://')
WHERE object like 'https://https://%';

# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';

# what things would look like after update
SELECT subject,
  predicate,
  replace(object, 'https://sinopia-production.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com') AS remediated_object
FROM acl
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';

# clean up incorrect hostname in webID URLs
UPDATE acl
SET object = replace(object, 'https://sinopia-production.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com')
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';
jmartin-sul commented 5 years ago

the COGNITO_DOMAIN that was being used to construct WebID URLs for ACLs was not actually the domain we wanted for that use. that domain was the host we specified that the user would be redirected to when we were using the amazon cognito provided login page. this is different from the domain that's used in the iss field of the JWT, which is not something we're allowed to specify. this discrepancy is why our ACL webID URLs weren't matching the webID URLs constructed from the iss and sub fields of the JWTs being minted for users.

it turns out that the entirety of the iss field can be made available to the container as an environment variable (the via endpoint field of the user pool variable in terraform). the terraform PR for providing that has been merged, and the terraform changes have been applied. there's also an outstanding sinopia_acl PR to make use of that (and to fallback to constructing the current form of that URL based on aws region and user pool id, which are already correctly provided to the container). see https://github.com/LD4P/sinopia_acl/pull/80

once that PR is merged, and the code is deployed and running without issue, we should be able to get rid of the cognito domain entirely (both the env var that we provide to the sinopia_acl container, and the underlying setting in AWS). this is because we now handle login entirely within the sinopia_editor application, using the AWS amplify SDK. users no longer get redirected to an AWS hosted login page.

so, left to do:

jmartin-sul commented 5 years ago

still not quite working as expected on stage.

here's what looks right:

if i get a cognito token for a stage user who doesn't have control privileges, i get a 403 forbidden when i try to curl the root ACL. this also seems right to me.

however, not all the ACL entries i'd expect are present on stage. in particular, these are missing: https://github.com/LD4P/sinopia_acl/blob/master/config/default.js#L15-L18

(maybe a red herring, but i found it odd that those are the last four in the list, and are contiguous. maybe only the first write took? maybe the user doing the update now from AWS doesn't actually have permission in the current stage trellis root ACL?)

all of those users exist in the stage cognito user pool, but none of their webIDs are listed in the ACL on stage at the moment. i am stumped at the moment as to why they aren't being added to the ACL. nothing in the code that builds or writes the ACL jumps out at me as the culprit.

next up, i can try running the client locally on my laptop against stage trellis, to see if there might be other things i can discover from that, esp w/ some debug logging.

some useful links for when i get back to this:

jmartin-sul commented 5 years ago

i suspect that this oddity:

as was the case on dev, stage and prod didn't get bad values set once the bad data was remediated to use the correct URLs. still unsure why this is the case, though i'd like to figure it out for my own understanding.

was a result of the specified admin user not actually being an admin user, per https://github.com/LD4P/sinopia_acl/issues/83.

jmartin-sul commented 5 years ago

since this issue has drifted pretty far from its original intent, i'm closing it in favor of something more specific in the applicable codebase's repo: https://github.com/LD4P/sinopia_acl/issues/83

feel free to re-open if that seems erroneous, and please add any detail i may have missed that should be captured on the new ticket.