Open hydra.nixos.org database read replica

RaitoBezarius commented 4 months ago

Is your feature request related to a problem? Please describe.

hydra.nixos.org is private because it contains PII and it's weird to expose a webapp database publicly, I suppose.

Also, it is first and foremost for Hydra's operations, e.g. CI, etc.

But Hydra's data is very valuable for a lot of automations of our ecosystem: Haskell updates, staging, and even more which does not exist because it can be hard to stabilize a hydra.nixos.org (not a Hydra) tooling.

Describe the solution you'd like

(Logically?) replicating the PostgreSQL database to replica.db.hydra.nixos.org or readonly.hydra.nixos.org for example while prioritizing the main instance.

Describe alternatives you've considered

continue scraping Hydra
continue hammering Hydra's API
having no relevant automation based on the CI

Additional context

I am available to conduct any nixpkgs work to make the work easier if necessary and present it in the form of a patch to apply to a stable release of NixOS 23.11 (or whatever current version is the stable one).

delroth commented 4 months ago

Given that the main issue isn't really how expensive DB queries are but how bad the Hydra API is, what would you think instead of serving this directly from the primary Hydra DB as a less privileged user with an allowlist of tables/columns? This would be significantly less setup, at the cost of less resource isolation.

Other alternatives not mentioned in your first post:

Maintaining a curated list of DB views / stored procedures based on automation use cases.
Improving the Hydra API to actually provide the information that automation users want.
Building a separate API frontend for automation users written in a programming language that's not widely considered to be unreadable and obsolete.

RaitoBezarius commented 4 months ago

Given that the main issue isn't really how expensive DB queries

Is it? Even with the read replica, I can fire queries that can hog some resources… I think that also have to do with the (a) whole database scheme (b) database maintenance (c) lack of use of various tools to handle "large scale data" over time (partitions etc.).

If you are still convinced that DB queries are not expensive (fine by me :p!), we can start with a less privileged user, what would be the process in your eyes to get access to it? And it makes sense for me to have an allowlist, I could find my SQL stuff to bestow permissions on the read-only instance.

Do you want me to add your suggestions of alternatives back up?

delroth commented 4 months ago

I'm not saying you can't design a SQL query to DoS the instance, I'm saying that most queries that would be useful probably won't be too expensive.

Personally I'm still of the opinion that the better approach would be to improve Hydra's API. It's unlikely we'd want to provide unauth'd/anonymous access to a Postgres instance, even a readonly replica - I'd be worried about possibly security implications of this. Which means we'd still want to have some signup process, which is more cumbersome than a properly working open API. However I don't think anyone has really done the work of figuring out what would be needed to make a Hydra API more useful. I suspect "not that much", tbf.

RaitoBezarius commented 4 months ago

Ack @ first paragraph.

I do agree with the target of improving the Hydra API, I am not sure I agree with the security issues: what are the concerns concretely? Having a read replica owned does not seem to be that big of a deal as long as everyone understands this data could be manipulated in case of an attack. Alternatively, we could offer regular data exports, IIRC, Repology just open the PostgreSQL, crt.sh does the same.

Finally, it's possible that someone could find a way to make the Hydra API so much better that this whole proposal is short circuited, I do not have the mental fortitude to peek at the Perl code and go through all those hoops to what I would fear to be a disappointing result, but also, I do not have the mental fortitude to rewrite the Hydra API backend in a more comfortable language and go through all those hoops too.

That's why speaking SQL seems to be, in the end, the easiest solution with the most value, which also does not preclude improving the API.

If that's necessary, the DB access could be made available to community members who could be trusted only for the time being.

delroth commented 4 months ago

I am not sure I agree with the security issues: what are the concerns concretely? Having a read replica owned does not seem to be that big of a deal as long as everyone understands this data could be manipulated in case of an attack.

Maybe my security model here is miscalibrated. I'm worried that PgSQL isn't particularly designed for the case of a malicious actor having direct query access. And the risks I'm worried about are access to the user data (assuming we actually have a full replica, not just partial) as well as code execution from a machine we're responsible for - with potential escalation/pivoting risks from there. Your examples suggest that maybe this isn't the case - or at least that some people though this isn't the case :)

Is there a reason for an improvement to the Hydra API to lead to a disappointing result? It should be exactly as powerful as what can be done through SQL queries, since it's just a wrapper above the latter. But yes, I share your lack of enthusiasm for writing Perl, though I do think it would be worth the cost if it meant automation use cases become easier.

Note: I'm only spending time discussing this because it's unlikely to happen within a very short timeframe anyway - I'm not fully against going the "auth'd readonly PgSQL replica" route even if I feel like it's the inferior solution, since I feel like we all have a better idea of the amount of work that would be involved to implement it.

RaitoBezarius commented 4 months ago

I am not sure I agree with the security issues: what are the concerns concretely? Having a read replica owned does not seem to be that big of a deal as long as everyone understands this data could be manipulated in case of an attack.

Maybe my security model here is miscalibrated. I'm worried that PgSQL isn't particularly designed for the case of a malicious actor having direct query access. And the risks I'm worried about are access to the user data (assuming we actually have a full replica, not just partial) as well as code execution from a machine we're responsible for - with potential escalation/pivoting risks from there. Your examples suggest that maybe this isn't the case - or at least that some people though this isn't the case :)

It is, though, kinda? So we don't really use that, but PostgreSQL has features like row-level security (RLS) to enable "semantic users" to live directly at the PostgreSQL side. I might be biased and maybe trust overly PostgreSQL here in this instance, but compared to the other security risks we have in the project, PostgreSQL seems rock solid and probably not the low-hanging fruit for an attacker.

Now, there are two risks:

user data compromise: remember that most if not all PII is available on hydra.nixos.org, you can just scrape it. If that's a real concern for us, I would suggest that we back up all the PII somewhere, and we clear it from the main instance or something (modulo user accounts) and we rely on maintainer-list.nix source of truth to fill this need.
code execution on the machine: of course, that's a concern, we are publicly exposing an interpreter, I believe though that we ought and can harden significantly the PostgreSQL instance to make it hard to use it as something to perform further actions, and we have active people around watching for what's going on.

I strongly recommend not to take a full replica, and use logical replication to drop explicit columns and tables if needed. Hence, certain data will not exist on the replica.

If there's a concern for running this sort of instance on infrastructure duties, I can offer 2 VMs where one will receive the replication from the main instance privately with credentials and another VM which will replicate again against the secondary trusted instance and be the untrusted instance.

Is there a reason for an improvement to the Hydra API to lead to a disappointing result? It should be exactly as powerful as what can be done through SQL queries, since it's just a wrapper above the latter. But yes, I share your lack of enthusiasm for writing Perl, though I do think it would be worth the cost if it meant automation use cases become easier.

Basically, the way I see it is that:

(a) people does not know what the Hydra database contains and can do so they don't notice what they could do because they usually don't access to some exploratory database to fire up some queries and realize : oh I could just do X (b) people have some threshold to contribute to the Hydra API to make it do something useful and while we could have people thinking in the stead of everyone else to build useful Hydra APIs, if I was asked to do it, it would cost me a lot of personal efforts and energy to do this, so I assume there won't be a lot of people stepping up to offer "I can design Hydra APIs for your data needs as a community service" — maybe, there's not too many Hydra APIs we need to design though? The Haskell people are probably the first to ask, and anyone can take their code and look at what is needed to make their code work sanely with some sort of Hydra API. (c) you will have some lag always between "I need X" and "I have X" which you don't have with the SQL replica where you can go from "I need X" to "I have X" without asking anyone and just running your queries over the replica

For me, the value here is really about letting people just explore/do/build before sitting down and designing and having people explaining to us what is their need, etc. We can ask people to come and do it, but if this ends up being yet again another moment where people fill issues over Hydra and there's not anyone comfortable to go over them, I fear this might be dangerous to people's goodwill.

So taking your own words, I personally trust the infra team for infrastructure matters, but do you think we should also give a second chance to having a Hydra development team (which I'm not sure even exist at the moment, does it?)?

If so, the Hydra API usecase makes sense and we can proceed with that instead, but if not, I would still recommend the replica solution.

Note: I'm only spending time discussing this because it's unlikely to happen within a very short timeframe anyway - I'm not fully against going the "auth'd readonly PgSQL replica" route even if I feel like it's the inferior solution, since I feel like we all have a better idea of the amount of work that would be involved to implement it.

We can have auth'd version, and we could re-expose unauth'd version if we re-export while removing all the PII and agree on that in the future.

And I believe this will make it easier to build the Hydra API and decrease the need for the direct usage of the replica.

How do you see the timeframe for this BTW?

NixOS / infra

Open hydra.nixos.org database read replica #348