[FLORA-7] Tarball support

RaoulHC commented 9 months ago

Proposed changes

This supports hosting source code by constructing a directory tree and storing files and directories by hash in the database.

This both reduces space of tarballs (about 10-20% with the packages I've tested) and removes the need for metadata revisions, because small updates are cheap.

Having this as a draft to begin with while I finish off some bits, but discussion about the general approach and structure would be much appreciated.

Couple of things left to do:

Tests - I've just been testing it on importing the Agda project as a good sizeable example, but it needs a proper test suite addition.
Feature flag - Probably some bugs and additional features we want, so might be good to not run this by default to begin with
Too many connections error - When running this via odd jobs scheduler, you can get psql too many connection errors. Need to make sure it doesn't do this.
We don't store additional metadata on the files like access times, ownership or permissions, though I'm unsure which if any we should.

There is some stuff that might want to be in future PRs, can raise issues if we decide that.

URL discussion - The current tarball location is different from where hackage stores them, needs a bit more research to see what cabal expects, but probably doesn't matter initially, and perhaps warrants a wider discussion of whether flora should align with hackage.
Querying for single files - atm we can only get the full tarball, but we should eventually support viewing source code files and this shouldn't have to construct the full directory tree.
Better handling when an import fails - atm if we have a failure midway through a release import we end up with a partial directory trees being stored. The current method makes sure there's no conflicts when multiple jobs run in parallel, but will also mean a reimport wont ensure full directories are stored. Making this recoverable will need some thought, but this still might be ok to leave until later.
Find more bugs - At the moment this makes some assumptions about the structure of tarballs, I'm sure trying to import more of Hackage will reveal that some of these assumptions are wrong.

Contributor checklist

[x] My PR is related to #7
[x] I have read and understood the CONTRIBUTING guide
[x] I have inserted my change and a link to this PR in the CHANGELOG

tchoutri commented 9 months ago

@RaoulHC excellent usage of the dynamic effects

RaoulHC commented 9 months ago

I think this is good for a more in depth review now :) I'll create some issues for the future work

I had a bit more of a look into the libpq too many connections error, and I think it's due to the fact you create two pools for odd jobs and the flora server here, so even if you specify less than the max connections (96 in environment.sh vs 100 default connections) you can easily go above that with two pools if you're maxing out one, as you will if you try and import the tarballs of lots of releases. If that sounds right I'll raise a separate issue for dealing with that, not sure what the solution would be.

tchoutri commented 9 months ago

Yes that rejoins my own conclusions, which is why we should move from 96 to 50 and keep the 100 max connection on the postgres side.

RaoulHC commented 9 months ago

Yes that rejoins my own conclusions, which is why we should move from 96 to 50 and keep the 100 max connection on the postgres side.

There is something slightly unsatisfying about that, but perhaps there isn't a good way of sharing the connections between the server and the jobs pool, and especially not if we end up having multiple distributed server front ends or something along those lines.

tchoutri commented 9 months ago

I'll ask my DBA friends kindly for information & heuristics about pool allocation. I do have pgBouncer set-up however, so that will come into account in prod.

MangoIV commented 9 months ago

I like this ❤️

tchoutri commented 9 months ago

@RaoulHC If I may get a little anxious in anticipation of the worst: I'm not overly confident about the de-structuring of the tarball without keeping the original on hand. Shouldn't we keep them on disk? This would also speed up the process of sending the tarball because we can delegate efficiency to nginx or directly use sendfile if we must.

tchoutri commented 9 months ago

Also, did you encounter such errors locally? <stdout>: hPutBuf: resource vanished (Broken pipe)

RaoulHC commented 9 months ago

@RaoulHC If I may get a little anxious in anticipation of the worst: I'm not overly confident about the de-structuring of the tarball without keeping the original on hand. Shouldn't we keep them on disk? This would also speed up the process of sending the tarball because we can delegate efficiency to nginx or directly use sendfile if we must.

I think if we're also keeping it on disk then we lose the benefits of de-structuring it, though we could store the hash of the final compressed tarball to have some of check that it never changes and make sure that's something that gets checked in CI. The tarball this method creates is slightly different to the original one, as it ensures there aren't duplicate files (which does exist in Hackage tarballs!), gives more consistency to the directory order, and throws away unnecessary file metadata.

With regards for sending the file, setting up some sort of caching would definitely make sense for the tarball endpoints. I've briefly used squid in the past, but it's something I'm certainly not an expert in.

Also, did you encounter such errors locally? <stdout>: hPutBuf: resource vanished (Broken pipe)

Hmmm no I not encountered that, have you run out of space on whatever disk/partition you're using (physical or inode number)? Would have to get more information otherwise because resource vanished isn't particularly informative.

tchoutri commented 9 months ago

I think if we're also keeping it on disk then we lose the benefits of de-structuring it, though we could store the hash of the final compressed tarball to have some of check that it never changes and make sure that's something that gets checked in CI. The tarball this method creates is slightly different to the original one, as it ensures there aren't duplicate files (which does exist in Hackage tarballs!), gives more consistency to the directory order, and throws away unnecessary file metadata.

So, this slightly bugs me because there is an alteration of the original archive as uploaded by the author. Which means that when inevitably we mess stuff up in the code, we will have to either fetch those back from Hackage (easy), or ask kindly the authors to re-upload the code. And that's indeed problematic. Moreover fact that the archives differ mean that the hash differs, and if someone creates a detached signature for their archive (or if another tool gains support for it), then there is a mismatch.

I fully understand the argument for space efficiency, however. The angle I'm trying to explore here is "what do we do in case of disaster, what are our recovery options?".

I think if we're also keeping it on disk then we lose the benefits of de-structuring it

Don't sell yourself short like this, it's very useful to be able to provide the browsing of the archive from within Flora! I use it myself in Hackage to smuggle documentation like https://hackage.haskell.org/package/pg-entity/src/docs/book/index.html

tchoutri commented 9 months ago

Hmmm no I not encountered that, have you run out of space on whatever disk/partition you're using (physical or inode number)? Would have to get more information otherwise because resource vanished isn't particularly informative.

Ah bollocks. No I don't believe I'm running out of space. Indeed, "resource vanished" is not useful at all.

tchoutri commented 9 months ago

With regards for sending the file, setting up some sort of caching would definitely make sense for the tarball endpoints. I've briefly used squid in the past, but it's something I'm certainly not an expert in.

I use Varnish in front of flora-server, it is very very useful!

tchoutri commented 9 months ago

@RaoulHC I'm looking at the CI logs, are you sure you're using the correct module from our version of @hackage/base16?

RaoulHC commented 9 months ago

So, this slightly bugs me because there is an alteration of the original archive as uploaded by the author. Which means that when inevitably we mess stuff up in the code, we will have to either fetch those back from Hackage (easy), or ask kindly the authors to re-upload the code. And that's indeed problematic. Moreover fact that the archives differ mean that the hash differs, and if someone creates a detached signature for their archive (or if another tool gains support for it), then there is a mismatch.

I fully understand the argument for space efficiency, however. The angle I'm trying to explore here is "what do we do in case of disaster, what are our recovery options?".

I think it's worth noting that hackage also does this, there has never been any guarantee that the tarball provided by hackage is exactly the same as the one the author uploaded, see here, which normalises paths, changes hard links to soft links, and throws away anything else. Looking at which also makes me realised we should support symbolic links if hackage does, but perhaps that can just be another PR.

As for disaster recovery, yes maybe it's a good idea to store tarballs somewhere, though in which case should it be stored somewhere separately from the blob store? Perhaps somewhere you only write new tarballs to, and only read if you need DR.

RaoulHC commented 9 months ago

Doh, missed commiting my change to flora.cabal. I've used base16-bytestring because it's already a transitive dependency and it's in horizon which base16 is not.

RaoulHC commented 9 months ago

Looking a bit more at the resource vanished error, I think it's something to do with the logging given it's file descriptor that's causing the issue. Did it only happen in one FetchTarball job, or multiple?

Given that it does seem like only one logger is created is Server.hs and then passed around, maybe it's one of the putStrLns scattered about the codebase about that's caused some concurrency issue?

tchoutri commented 9 months ago

Looking a bit more at the resource vanished error, I think it's something to do with the logging given it's file descriptor that's causing the issue. Did it only happen in one FetchTarball job, or multiple?

A bunch of them, I spared you the whole thing but it means that a certain proportion of releases end up with no archive hash

tchoutri commented 9 months ago

I think it's worth noting that hackage also does this, there has never been any guarantee that the tarball provided by hackage is exactly the same as the one the author uploaded, see here, which normalises paths, changes hard links to soft links, and throws away anything else. Looking at which also makes me realised we should support symbolic links if hackage does, but perhaps that can just be another PR.

Oh interesting, thank you! Indeed it seems interesting to normalise archives a bit. However in "meta-index mode" we would need to disable our own optimisations in order to serve a strictly equivalent archive so that the hash we get from the cabal archive's JSON file matches.

As for disaster recovery, yes maybe it's a good idea to store tarballs somewhere, though in which case should it be stored somewhere separately from the blob store? Perhaps somewhere you only write new tarballs to, and only read if you need DR.

I'm thinking of the following workflow:

A tarball gets uploaded
The original gets stored on disk, then sent to a cold storage.
The processed version gets stored within Flora
The processed version gets cached by varnish/nginx/squid/whatever
a) We need to refine / fix a bug in the processing of the tarballs, which is good because we have the originals at hand, so we can re-import them. b) We need to recover from an incident: A backup of the tarballs can be re-imported from the cold storage.

And indeed I'd like to set-up a DR process where I'd nuke parts of the system and see in how much time we can get back up and running.

RaoulHC commented 9 months ago

Oh interesting, thank you! Indeed it seems interesting to normalise archives a bit. However in "meta-index mode" we would need to disable our own optimisations in order to serve a strictly equivalent archive so that the hash we get from the cabal archive's JSON file matches.

Ah do you envision the meta-index mode also providing tarballs? Yeah in that case we would probably want to serve the original tarballs. Having said that if anyone did want to point their cabal instance to flora I believe cabal would store it all under a different repo folder and having a different hash wouldn't really matter. Though there might be other reasons to mirror the exact tarballs.

I'm thinking of the following workflow:

1. A tarball gets uploaded
2. The original gets stored on disk, then sent to a cold storage.
3. The processed version gets stored within Flora
4. The processed version gets cached by varnish/nginx/squid/whatever
5. a) We need to refine / fix a bug in the processing of the tarballs, which is good because we have the originals at hand, so we can re-import them.
   b) We need to recover from an incident: A backup of the tarballs can be re-imported from the cold storage.

And indeed I'd like to set-up a DR process where I'd nuke parts of the system and see in how much time we can get back up and running.

That all sounds fairly reasonable, though I think somewhat orthogonal. Perhaps something to have before making tarball storing default behaviour, but could be another PR? I'll raise an issue if you think so. If we use labelled effects we could use the blob store api for storage but allow users to configure it to be stored somewhere else. Could be their own file system storage (you could set up a cold storage NFS or similar) and when we add S3 support, I see amazon has glacier storage for this sort of use case.

tchoutri commented 9 months ago

Ah do you envision the meta-index mode also providing tarballs?

I think right now my answer would be no. These artefacts are better served by each index's CDN (Fastly for Hackage, GitHub's, etc).

Having said that if anyone did want to point their cabal instance to flora I believe cabal would store it all under a different repo folder and having a different hash wouldn't really matter.

Should a flora instance be used as a package repo yes the tarballs would be stored under their own directory, but using flora.pm is not viable because it aggregates packages with the same name and cabal does not provide a way to distinguish those packages by namespace within a single repo, the overlay model is preferred.

That all sounds fairly reasonable, though I think somewhat orthogonal.

Yes, the only addition I will ask of this PR is to store the original tarball and I think we're good to merge!

RaoulHC commented 9 months ago

Ok I've updated it to import the unaltered archives into the blob store directly now and store the full hash of that as a column in the releases table. I also deleted the getRelease function, which I tried to use before realising it was quite broken, it's not used anywhere and references columns that the release table doesn't have.

I'll write up an issue now about fleshing out the DR process for this.

RaoulHC commented 9 months ago

Just a follow-up thought, but I realised this is storing the tarballs uncompressed, thinking that we'll leave that to whatever storage is being used. All the content is recoverable, so I don't think it's an issue, but if different compression options are used then the hash of the compressed tarball might be different.

tchoutri commented 8 months ago

Just a follow-up thought, but I realised this is storing the tarballs uncompressed, thinking that we'll leave that to whatever storage is being used. All the content is recoverable, so I don't think it's an issue, but if different compression options are used then the hash of the compressed tarball might be different.

Okay, let's split this in another ticket. This PR is already rich enough with shiny new stuff. :)

flora-pm / flora-server

[FLORA-7] Tarball support #452

Proposed changes

Contributor checklist