JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.4k stars 5.45k forks source link

Computing the CRC hash of the sysimage costs a non-negligible startup time #50166

Open KristofferC opened 1 year ago

KristofferC commented 1 year ago

The red box below is the time to compute the CRC hash of the sysimage as part of julia_init.

image

@JeffBezanson noticed that this CRC is only used to update the build IDs of modules. It should therefore be fine to do this CRC hashing at build time and store it somewhere. Another alternative is to do the computation on demant.

stevengj commented 1 year ago

In principle, we could save the CRC of the sysimg in the sysimg file itself, using the CRC-compensation trick: https://stackoverflow.com/questions/8608219/save-a-crc-value-in-a-file-without-altering-the-actual-crc-checksum

(Would be nice to have that CRC-reversal function in the stdlib.)

Update: submitted a PR #50269 to add this functionality to the CRC32c stdlib.

stevengj commented 1 year ago

this CRC is only used to update the build IDs of modules.

If that's the case, why are we using a checksum at all? Why not just generate a unique ID (e.g. a UUID) for each sysimage build and store it in the sysimage?

vchuravy commented 1 year ago

I thought we are using it as a verification mechanism to make sure that cache files are only loaded on matching sysimages

KristofferC commented 1 year ago

Wouldn't that work the same with a UUID?

PallHaraldsson commented 1 year ago

Please load CRC on demand (and/or use whatever very fast method e.g. in a thread).

I noticed at julia_init "initialize many things, in no particular order", so for the general issue the slowish startup, could we start with the slowest part, LOAD_Sysimg (is that jl_load_sysimg_so?), in a separate thread (or if better a new thread for the other things)? I suppose LLVM was at some point needed when native code wasn't stored, so loaded first but no longer needed at all always, or at least to start with. [And for LOAD_Sysimg itself, not sure what's slowest there, another thread for longest sub-part there (recursively)?]

vchuravy commented 1 year ago

You can't do much before the sysimg is loaded so that's a moot point. We must calculate the CRC before we load it since we will mmap the file and start modifying the mapped pages.

Loading the Julia compiler lazily is an interesting idea

PallHaraldsson commented 1 year ago

I'm confused, yes you can't do much before loading the sysimage (I suggested stating with that), but you can mmap it without a CRC; and start modifying (you mean the sysimage itself?) something, when do you really need to?! "this CRC is only used to update the build IDs of modules". I'm not sure what that means exactly, if it's for modules not relating to the sysimage then fine, and even if, why not defer doing that? And if/since this is for cache files, then those are not updated at startup?!

vchuravy commented 1 year ago

Currently we:

  1. dlopen sysimg
  2. Calculate CRC of unmodified data
  3. Perform data-relocation/deserialization

The CRC is used as a safety mechanism against corruption of the system-image. Cache files are dependent on the sysimage and any data-corruption would be hell to debug. Yes we could store the CRC somewhere, but that would defeat that purpose.

PallHaraldsson commented 1 year ago

I realize that, I'm thinking a) you can run julia for scripts (an important use-case, also for benchmarks...) and never need packages/modules, so verifying CRCs in cache files is not needed, thus the CRC not needed. I.e. I think we can rely in the sysimage not corrupted, as with any other binariy (executable) files, such as julia itself. b) When/if the CRC (or its possible replacement) is actually needed for anything, then an UUID seems better, to not scan the sysimg file again (for the unmodified file, rather than changed mmaped version), and if it really needs to be a 32-bit value (i.e. what you expect in cache files), then some just use some random one or hash of UUID could be ok, that is would be stored in the sysimg?

stevengj commented 1 year ago

The CRC is used as a safety mechanism against corruption of the system-image.

If that's the case, shouldn't you also immediately abort if the CRC has changed, rather than only using it to decide whether to invalidate module cache files?

Currently, if the sysimage was corrupted so that the CRC changed, what happens? Doesn't it just think that the .ji files are stale? Aren't we talking about this code in loading.jl?

If you're only using it to determine cache staleness and build IDs, then just a UUID would work as well (or better — it's more unique than a 32-bit CRC). If you are using it to check for corruption, then it seems like you need to store the CRC in the sysimage file, ala #50269, and compare to it during init.

timholy commented 1 year ago

The bug this fixed has to do with PackageCompiler (https://github.com/JuliaLang/julia/issues/48354#issuecomment-1448934384) for which the "expanded" sysimg presumably has the same UUID as the default sysimg. However, this may be more conservative than needed.

PallHaraldsson commented 1 year ago

You meant CRC, since that's used, not yet UUID, but I see it added in this unmerged code of yours in case it's helpful:

https://github.com/JuliaLang/julia/pull/48836/files

Currently, if the sysimage was corrupted so that the CRC changed, what happens?

That shouldn't happen with modern file-systems (that have their own CRC, or the disks themselves)?! I don't think people are thinking of an intentional change/corruption to the sysimg, then CRC isn't enough, so I repeat just drop this CRC (for an UUID, or if paranoid cryptographic checksum), or at least not worry afbout checking it for corruption, as opposed to the verification?

I think realistically we can only expect truncated files, if untaring with disk full or copying/downloading is such situation, so simply storing a file length in the sysimg and checking it might be enough for checking the sysimg itself?

https://superuser.com/questions/1566851/can-original-file-get-corrupted-during-copying

Any chance? Yes, but it's more likely that you get struck by lightning while being bitten by a shark... Theoretically it could happen, but the chance is so low it is realistically impossible as the original file is only read and not written to. [..]

timholy commented 1 year ago

My point was that the UUID doesn't work, but the CRC does. As summarized in the discussion of that pull request, it wasn't merged because it didn't fix the problem.

stevengj commented 1 year ago

@timholy, I fail to see why a UUID wouldn't work, if it were re-generated uniquely for every sysimage build and embedded in the image. Or if you are happy with a 32-bit ID, then a random UInt32.

In any case this seems orthogonal to whether you need to compute the checksum or ID at load time, versus just storing it in the sysimage. It seems we are currently only using the checksum for labeling package builds, and aren't using it to check for corruption.

timholy commented 1 year ago

I'm not saying it can't work, I'm explaining why we went with the obviously-safe CRC and the nature of the failure mode with the naive UUID. There are enough places the UUID gets used that when it comes to PackageCompiler, it will presumably take a bit of effort to disentangle the potential confusion around "which UUID: the 'bare' sysimage that was loaded or the 'expanded' sysimage that's being created?"

KristofferC commented 1 year ago

To me it feels pretty obvious. PackageCompiler creates a new sysimage so it should have a new UUID?

KristofferC commented 1 year ago

From what I understand the CRC is already stored in the sysimage header: https://github.com/JuliaLang/julia/blob/0e8af1c1620cbf5304c8a7cabbc5475ec48a78ec/src/staticdata.c#L2772-L2774

So instead of recomputing it at startup, we could just grab it from there. I don't think it is really reasonable to check a single file for random corruption on every Julia startup.