anchordotdev / cli

MIT License
386 stars 6 forks source link

Anchor lcl audit crashes #5

Closed jscarrott closed 8 months ago

jscarrott commented 8 months ago

Trying to init lcl.host the audit step crashes image

geemus commented 8 months ago

Thanks for the report, sorry to hear you are running into problems.

My guess is that we still have some edge cases around how we are scanning the local trust stores that we haven't caught yet, but I could use a little more context from you to narrow it down.

Could you share a bit more about your setup? In particular, I think these would be helpful:

Hopefully from there we can pin it down pretty quickly and get this fixed. Thanks!

jscarrott commented 8 months ago

Ubuntu 23.10, and yes I have firefox installed. I have probably messed around with local trust stores at some point as well, to get dev environments to have valid certs...

geemus commented 8 months ago

Ok, still a work in progress but I wanted to share where I'm at and see if it will help.

First, as you can see here we have some places that don't return the nicest/most helpful errors. I'm working on a patch for this, so that we should be able to return much more helpful messages in future versions.

That isn't very helpful to you presently though. From reviewing the code and error handling, it seems pretty likely to me that you are seeing a certutil bad database error. Behind the scenes when we want to audit certs, we ask for a list using: certutil -L -d PATH. You can see what a bad database error looks like here by passing not a real path as the argument, for instance I used test.

I think we are detecting a path on your machine that we think should be valid, but that when we put it to certutil it disagrees. I can't say why exactly this would be the case, but if you've manipulated local stores in the past something in your changes might be the culprit.

As an example, I'm running Ubuntu 20.04 with firefox, and these are the paths we find/use on my machine:

If I had to guess, it's likely that your ~/.pki/nssdb is where the problem lies. Here are some good next steps:

I think/hope that this can get you back on track, and I'll work on getting the improved error messages in soon also. Do let me know if that is unclear or you have further questions or feedback though. Thanks!

jscarrott commented 8 months ago

This ~/.pki/nssdb seems okay when I query it and I still get the same failure if I rebuild it. Doing an fd to find cert8.db files I have a fair few dotted around. I rebuilt some likely candidates with no luck, with some better error logs hopefully I will find the offending DB.

geemus commented 8 months ago

@jscarrott Thanks for giving that a shot, I was hoping we could get you fixed ASAP instead of waiting for a release, but it definitely sounds like waiting will be easier. I hope to get that out pretty soon and will plan to update the ticket to let you know. From my testing it looks like it should tell us specifically which db it's trying to operate on that gives the error, so it should hopefully be MUCH easier for you to rebuild or get rid of the offending one.

geemus commented 8 months ago

Ok, I just released v0.0.18 with those error reporting improvements.

You should be able to update by running: brew update && brew upgrade anchordotdev/tap/anchor.

I believe that should give you a much more specific error which includes the key to the database causing the problems, at least if my guesses about what's happening are accurate. Hopefully rebuilding that will get everything working, so please let me know what you see. Thanks!

jscarrott commented 8 months ago

Ahh that gives me much more information, in my Firefox database I have added the let's encrypt staging certs (A really good idea I know) this seems to cause the error. Removing these from the DB got past this issue. I did have to use strace to work out exactly which DB was causing the issue though so maybe the error reporting needs a bit more info.

image

jscarrott commented 8 months ago

It now hangs at installing the anchor CA with no error logging

geemus commented 8 months ago

Oof, well at least it seems like we are making some progress. Sorry again to hear it's not getting you all the way through yet.

I can see a bit more room for improvement on our error messaging, so I'll work on that as you suggest.

It's not so obvious to me from reviewing the code where/why you would be getting a hang though (that seems unrelated to the error messaging stuff, unless I'm just missing something). Could you share your output on a hang so I can narrow down where exactly in the flow it might be?

jscarrott commented 8 months ago

I'm not at my machine but it's right after it says it needs to install two certificates. It asks for the db password or pin then just hangs.

geemus commented 8 months ago

@jscarrott No worries. That was my guess about what you meant, but I just wanted to confirm. Thanks.

geemus commented 8 months ago

In many cases, the point where you are freezing is where we would expect you might encounter a sudo prompt (if permissions for things require it). As a means of seeing if it gets us anywhere different and/or with better output, could you try this command, which should more directly jump to where you are pausing (you'll need to substitute in your username on anchor.dev):

anchor trust --org USERNAME --realm localhost --no-sudo

For me, I would normally see a sudo prompt where you hang, and with the no-sudo setting I instead get a tee error trying to update a file. I'm hoping this should help isolate whether it's failing to sudo prompt you or if something else is hanging, and then we can hopefully go from there.

geemus commented 8 months ago

Oh wait, actually I think I had overlooked what might be a crucial detail here. You said it's asking for the db password or pin, that must be from certutil. I don't think we had tested against cases where one or more of the certdbs had passwords, and I think from how things are setup certutil might then wait for input on stdin, and stdin isn't wired up to do anything. If that was the case, it could explain the hang. Did you rebuild any DBs earlier when we were working on this, or just remove certs? If it was rebuilding, it's possible we should have used the --empty-password flag, which looks like it could cause issues like this.

jscarrott commented 8 months ago

Yep that fixed it (removing the password)

geemus commented 8 months ago

Awesome, thanks again for working through that.

I'm going to close this for now since it sounds like we've resolved your problems. Please let us know if you run into any other issues or have further feedback. Thanks!