Closed mjgiarlo closed 5 years ago
I'm jcoyne
I'm petucket
(yes, different than my github username, but the same as my sunet which had to be shorter)
jlittman
@aaron-collier is amcollie
@rsmith11 I looked at the root Trellis ACL resource in all three environments, and found the following:
sinopia_acl
on June 18th (and pushed with latest
tag to DockerHub on the same day) were granted control access. that makes me wonder if the bin/migrate
job scheduled for yesterday ran, or if it generated errors.bin/migrate
has been successfully run in this environment, since the root ACL resources allows the world full access to the entire repository. (The container structure is there, though, so perhaps the migrate task is running and erroring out on the ACL changes?)The image running in development has been running since 5/28, so I don't think CI is working on the sinopia_acl code. I can do a redeploy in AWS which should update the image, then manually run the migrate script to pickup the changes.
@rsmith11 Good :eye:. In fact, the sinopia_acl CircleCI config does not refresh the dev infrastructure. Cool if we set that up? Do we want to do that, @LD4P/sinopia?
As for production, the sinopia-devs-client-tester account didn't exist. I created it, but I had to create a new password because the same password from dev/stage doesn't conform to the production password policy.
All 3 environments have had the task updated to the latest, and the acl task manually run.
@rsmith11 Huh, that's weird. Still didn't pick up the new accounts. I'm going to try to run bin/migrate
manually against dev to see if that helps troubleshoot.
@rsmith11 When I run the task (from my laptop) against the dev environment, it appears to succeed, but on subsequent requests (to view the root container ACLs), I get an HTTP 403 error. cc: @LD4P/sinopia
I'm going to check back on this next Monday after the scheduled task runs again.
@rsmith11 Here's where things stand on this issue.
The root ACLs in dev look good, and the weekly migrate task is running without error. The same is not true in stage or prod, both of which give me a 403 error. In CloudWatch logs for staging Trellis, the error is:
User: https://cognito-idp.us-west-2.amazonaws.com/us-west-2_ilMQW0M0R/711cc7fd-cb40-42c5-b768-e2af0acc27a6 cannot Control
That WebID does indeed correspond to my staging Cognito account. But the URI does not match what's in the staging Trellis instance, which is of the form: https://sinopia-staging.auth.us-west-2.amazoncognito.com/us-west-2_ilMQWM0R/711cc7fd-cb40-42c5-b768-e2af0acc27a6
Whereas dev trellis is using the correct form of the webid (with cognito-idp.us-west-2.amazonaws.com
as the hostname, not sinopia-staging.auth.us-west-2.amazoncognito.com
)
I think this means we're effectively locked out of stage and prod.
We can hand-fix the values in stage's PG instance underlying Trellis, I think. Not sure how best to handle prod, or if this is the right strategy for stage.
Additionally, the weekly ACL migration task is erroring in stage with WARN [2019-07-07 04:01:58,880] org.trellisldp.http.WebAcFilter: User: <http://www.trellisldp.org/ns/trellis#AnonymousAgent> cannot Write
so we can't rely on it fixing us up. (And I'm not entirely sure why Trellis is seeing that request as an anonymous agent. Should be the cognito admin user, which appears to be working OK in dev.)
Ideally, @rsmith11, you and I'd be able to pair on this but our vacations overlap, so the soonest we could do this is 7/29. I suppose this can wait until we're both back in the office, though? It's not blocking or breaking any other work as far as I can tell. Alternatively, someone else on the team can pair w/ you on this while I'm out. I think we should have more coverage on these back-end issues. Cc: @jmartin-sul @aaron-collier @jermnelson
@mjgiarlo, @rsmith11, @jermnelson: i could probably help troubleshoot this, and would at least be happy to try, assuming we want to get to it before 7/29. i'm a little less confident in my ability to navigate deployed services on AWS than in my ability to troubleshoot the ACL stuff in general, but maybe @aaron-collier could give me a hand with the AWS part if we wanted to get to this sooner?
@jmartin-sul I'm happy to pair/clump/whatevs on this tomorrow to at least do knowledge transfer if helpful.
thx!
attempted to curl the ACL URL from my laptop and got 401 unauthorized regardless of whether i was hitting the URL as an anonymous user or as an admin user (by providing a JWT for an admin user, obtained per the approach described in the sinopia_acl readme). in the latter case, @mjgiarlo pointed out that we'd actually expect a 403 response. so that was slightly surprising.
i then SSHed into the bastion host and used psql
to look at the acl
table in RDS. per @mjgiarlo's earlier investigation, the WebID URLs in the table don't have the hostname that's actually being sent along by the JWT. so it appears there's still something to be fixed in the way we get the WebID for setting in the ACL.
so i think the next thing to do is to see whether something is wrong with our username to WebID translation, which seems to do the right thing in dev, but not in prod or stage.
i believe i've remediated the existing webID values in stage to have the correct hostname. i used the following sql:
UPDATE acl
SET object = replace(object, 'https://sinopia-staging.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com')
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';
@rsmith11, i don't think i have production write access, so i think you or someone else in ops will have to run the above from the prod bastion host against the prod RDS instance using e.g. psql? i just opened a psql console and pasted that update
statement in and ran it.
i used the following select
statements to test my logic before running the aforementioned update
:
# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';
# what things would look like after update
SELECT subject,
predicate,
replace(object, 'https://sinopia-staging.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com') AS remediated_object
FROM acl
WHERE object like 'https://sinopia-staging.auth.us-west-2.amazoncognito.com%';
i guess you could also just combine that into one statement with the unaltered object
col from the first, and the remediated_object
col from the second.
note that in this schema object
is not a complex object, but a string, representing the object of a subject -> predicate -> object triple.
i've remediated the existing data on prod. that took a slight adjustment to the above sql (which looked for stage specific hostnames). it also took one extra step, to clean up some URLs that started with https://https://
(instead of just https://
). i recall @mjgiarlo mentioning that issue, though i don't see mention of that here after a cursory look over the issue (possibly was only mentioned in slack).
the queries and updates i used were:
# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://https://%';
# what things would look like after update
SELECT subject,
predicate,
replace(object, 'https://https://', 'https://') AS remediated_object
FROM acl
WHERE object like 'https://https://%';
# get rid of extra scheme portion in URL
UPDATE acl
SET object = replace(object, 'https://https://', 'https://')
WHERE object like 'https://https://%';
# what things look like now
SELECT subject, predicate, object
FROM acl
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';
# what things would look like after update
SELECT subject,
predicate,
replace(object, 'https://sinopia-production.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com') AS remediated_object
FROM acl
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';
# clean up incorrect hostname in webID URLs
UPDATE acl
SET object = replace(object, 'https://sinopia-production.auth.us-west-2.amazoncognito.com', 'https://cognito-idp.us-west-2.amazonaws.com')
WHERE object like 'https://sinopia-production.auth.us-west-2.amazoncognito.com%';
the COGNITO_DOMAIN
that was being used to construct WebID URLs for ACLs was not actually the domain we wanted for that use. that domain was the host we specified that the user would be redirected to when we were using the amazon cognito provided login page. this is different from the domain that's used in the iss
field of the JWT, which is not something we're allowed to specify. this discrepancy is why our ACL webID URLs weren't matching the webID URLs constructed from the iss
and sub
fields of the JWTs being minted for users.
it turns out that the entirety of the iss
field can be made available to the container as an environment variable (the via endpoint
field of the user pool variable in terraform). the terraform PR for providing that has been merged, and the terraform changes have been applied. there's also an outstanding sinopia_acl PR to make use of that (and to fallback to constructing the current form of that URL based on aws region and user pool id, which are already correctly provided to the container). see https://github.com/LD4P/sinopia_acl/pull/80
once that PR is merged, and the code is deployed and running without issue, we should be able to get rid of the cognito domain entirely (both the env var that we provide to the sinopia_acl container, and the underlying setting in AWS). this is because we now handle login entirely within the sinopia_editor application, using the AWS amplify SDK. users no longer get redirected to an AWS hosted login page.
so, left to do:
COGNITO_DOMAIN
env var for the sinopia_acl container and the underlying terraform variable (cognito_domain
).still not quite working as expected on stage.
here's what looks right:
suntzu
), i can curl
the acl for the root resource and get it back successfully:
$ curl -i -H "Accept:application/ld+json" -H "Authorization: Bearer $(cat .cognitoToken)" https://trellis.stage.sinopia.io/?ext=acl
HTTP/2 200
...
if i get a cognito token for a stage user who doesn't have control
privileges, i get a 403
forbidden when i try to curl
the root ACL. this also seems right to me.
however, not all the ACL entries i'd expect are present on stage. in particular, these are missing: https://github.com/LD4P/sinopia_acl/blob/master/config/default.js#L15-L18
jcoyne
petucket
jlittman
amcollie
(maybe a red herring, but i found it odd that those are the last four in the list, and are contiguous. maybe only the first write took? maybe the user doing the update now from AWS doesn't actually have permission in the current stage trellis root ACL?)
all of those users exist in the stage cognito user pool, but none of their webIDs are listed in the ACL on stage at the moment. i am stumped at the moment as to why they aren't being added to the ACL. nothing in the code that builds or writes the ACL jumps out at me as the culprit.
next up, i can try running the client locally on my laptop against stage trellis, to see if there might be other things i can discover from that, esp w/ some debug logging.
some useful links for when i get back to this:
i suspect that this oddity:
as was the case on dev, stage and prod didn't get bad values set once the bad data was remediated to use the correct URLs. still unsure why this is the case, though i'd like to figure it out for my own understanding.
was a result of the specified admin user not actually being an admin user, per https://github.com/LD4P/sinopia_acl/issues/83.
since this issue has drifted pretty far from its original intent, i'm closing it in favor of something more specific in the applicable codebase's repo: https://github.com/LD4P/sinopia_acl/issues/83
feel free to re-open if that seems erroneous, and please add any detail i may have missed that should be captured on the new ticket.
Please sign up for the three env-specific accounts below your name and check them off as you go.
When each person has checked off all their boxes, please ping @mjgiarlo with the username you've selected.
[x] @peetucket
[x] @jcoyne
[x] @justinlittman
[x] @aaron-collier
[ ] @mjgiarlo