Hirevo / alexandrie

An alternative crate registry, implemented in Rust.
https://hirevo.github.io/alexandrie/
Apache License 2.0
495 stars 54 forks source link

Cargo checksum verification fails #151

Open hpsjakob opened 1 year ago

hpsjakob commented 1 year ago

Hi,

I've a Problem with a private Alexandrie repository and Cargo: I've three projects: A, B and C.

B depends on A. C depends on A and B. A and B are deployed to Alexandrie.

Then I try to build project C I get the following error:

# cargo tree
error: failed to verify the checksum of `B v1.1.0 (registry `my-registry`)`

If I replace the dependency to project B by a direct git dependency it woks without any problems. So I guess it's something with having a dependency in the registry that again has a dependency in the private registry.

Could you give me a hint how to debug this problem?

Hirevo commented 1 year ago

Hi,

I've tried to replicate the situation you described using a fresh build of Alexandrie (built with the features frontend and sqlite), and in my case, C was successfully built without errors and cargo tree worked as well.

But, while taking a closer look at how the crate checksums are computed within the crate publication routine, I think this is possibly due to the crate being larger than the currently-hardcoded size limit of 10 MB.
In which case, the crate stored would have been silently truncated before being stored, which could lead to the checksum being different from the one Cargo computed before its publication.

I think you can verify whether this is what effectively happened by either:

This is not the first time there have been issues with the quite-low hardcoded size limit that I picked long ago arbitrarily.
I should have come around to fixing this by making it adjustable quite a while ago, but I was unaware of the "silent truncation without any errors" behaviour.
I promise to get to fixing both of those things very very soon (before the end of next week at least, but should be sooner).

Hirevo commented 1 year ago

The maximum crate size is now configurable using a new max_crate_size configuration option (under [general] section), and its default value has been increased to 50 MB (compared to the previously hardcoded 10 MB).

Feel free to try adjusting this value and see if it indeed solves your issue.

hpsjakob commented 1 year ago

Hi,

thanks for your response and for your quick fix. Unfortunately I belive this is not my issue. I just checked my crate file size and it's 2,9 KB.

On really strange thing I noticed is that on one PC I don't run into this issue while on two others I do. My dev setup is in a docker contaiener so everything should be exectly the same.

Do you have another idea what I could try to find that bug?

Regards, Jakob

Hirevo commented 1 year ago

It is quite worrying to have an error like this possibly buried in the code.
I'll be investigating more, and I'll come back to share whatever lead I find.
Thanks for identifying that this problem is unrelated to my initial theory.

Hirevo commented 1 year ago

I've looked into this some more, but I could not identify from where this issue is coming from.

By doing some research, it seems that Cargo can sometimes emit this error when working with crate versions containing a + symbol in them (rust-lang/cargo#7180).
It seems from your original post that the version for crate B is v1.1.0 and therefore shouldn't be affected by this, but I wanted to mention this in case that v1.1.0 version was for a sake of example and this turned out to be what's causing this.

If this is not the reason, one thing that could help in figuring out what happened would be to:

With these three values, I suspect it could be easier to determine what's going on, especially if only one of these three is different.

hpsjakob commented 1 year ago

Hi,

thanks again for looking into it. It took me a while until I found time for testing it but here are my results: My version of B is 0.4.0 so it should not be affected by that issue.

I've tested on two PCs: My laptop and the server. In both I use the same docker container, so there should be no difference. On both systems I ran the curl command you suggested and in both cases got the same checksum.

When I ran cargo tree it succeeded on my Laptop while it failed with the checksum error on the server.

The cecksum I collected from the create-index was also equal to the ones I got from curl.

Also in the cargo.lock file the URLs were the same as the one I found before.

So all checksums matched.

For the curl command I concluded that alexandrie does not use the token to restrict read access. So I tried to remove the token on the CI job and then cargo tree worked without problems.

So I guess cargo is sending the token in the request when getting the crate and maybe this causes alexandire to send something else. Maybe a 404 page...

Does this help you finding the bug? Should I test something else?

Hirevo commented 1 year ago

Thank you for coming back and continuing to help to get at the bottom of this issue.

I have done more testing and tried running cargo tree with various tokens (invalid token, missing token, present but empty token) and none of them triggered any issue with Cargo, which makes sense since the crate download endpoint is indeed non-authenticated.

But then, I managed to recreate the checksum verification error by forcing the registry to return an error for that endpoint.
It seems this is always the error Cargo shows if the crate download endpoint responds with any error of any kind.

The crate download endpoint can indeed error if the crate can't be found in the registry's database:

https://github.com/Hirevo/alexandrie/blob/3e89db19f9643857c788993bb3212094622b441a/crates/alexandrie/src/api/crates/download.rs#L51-L53

But this Cargo error would also be there if any of the ? operators we use in that endpoint did encounter an error.
Looking at the code, this could therefore be either a database-related error, an error when accessing the crate from the crate store (if a crate archive is not found) or an I/O error when reading its contents (since we do a read_to_end there).

You can try to find out which of these cases happened by looking at the logs from that day, the response errors are typically visible in there.
Although, they still show up with a 200 status code because that is how Cargo normally expect errors.

They should look something like this, within the logs: image

hpsjakob commented 1 year ago

Thank you for investigating further.

I've checked the logs as you described. These lines look suspicious to me:

web_1  | Jun 09 07:04:32.234 INFO <-- GET /api/v1/crates/A/1.1.0/download
web_1  | Jun 09 07:04:32.256 INFO <-- GET /api/v1/crates/B/0.4.0/download
web_1  | Jun 09 07:04:32.257 INFO --> GET /api/v1/crates/B/0.4.0/download 200 0ms, SQL error: database is locked
web_1  | Jun 09 07:04:32.261 INFO --> GET /api/v1/crates/A/1.1.0/download 200 26ms, OK
locked

I'm using the SQLite variant.

I guess this could also explain why the problem only happens on my CI server. It's in the same cloud datacenter, so the connection between the CI server and the alexandrie server is a lot faster.

NotGovernor commented 1 year ago

I want to add another instance that might be related and duplicatable.

My project builds fine with cargo build. It also builds and runs fine inside docker (see dockerfile below). But I get the error discussed in this thread only when I build that same dockerfile from a docker-compose file (see compose file below as well). It's failing on surrealdb's checksum.

/backend/cargo.toml

[dependencies]
actix-web = "4"
serde = {version = "^1", features = ["derive"]}
serde_json = "1"
env_logger = "0.8"
# log = "^0.4"
surrealdb = "1.0.0-beta.9"
thiserror = "1"
dotenv = "0.15.0"
chrono = { version = "0.4.26", features = ["serde"] }

/backend/Dockerfile.development

FROM clux/muslrust:1.70.0 as builder
WORKDIR /app
ARG CARGO_BUILD_TARGET=x86_64-unknown-linux-musl

ARG RUST_RELEASE_MODE="debug"
COPY . .

RUN --mount=type=cache,target=/app/target \
    cargo build --target ${CARGO_BUILD_TARGET} \
    && cp ./target/$CARGO_BUILD_TARGET/$RUST_RELEASE_MODE/salesbackend /app/salesbackend;

FROM alpine:3 as runner
COPY --from=builder /app/salesbackend /app/salesbackend
EXPOSE 6996
CMD ["/app/salesbackend"]

/docker-compose.yml

version: "3.8"

services:
  ZeroSalesBackend:
    image: [redacted]
    restart: unless-stopped
    build:
      context: ./backend/
      dockerfile: Dockerfile.development

Error:

...
#9 2.406   Downloaded slab v0.4.8
#9 2.407   Downloaded signal-hook-registry v1.4.1
#9 2.408   Downloaded surrealdb v1.0.0-beta.9+20230402
#9 2.427 error: failed to download replaced source registry `crates-io`
#9 2.427
#9 2.427 Caused by:
#9 2.427   failed to verify the checksum of `surrealdb v1.0.0-beta.9+20230402`
------
executor failed running [/bin/sh -c cargo build --target ${CARGO_BUILD_TARGET}  && cp ./target/$CARGO_BUILD_TARGET/$RUST_RELEASE_MODE/salesbackend /app/salesbackend;]: exit code: 101
ERROR: Service 'SalesBackend' failed to build : Build failed

I hope this helps.

gzz2000 commented 1 year ago

Thank you for investigating further.

I've checked the logs as you described. These lines look suspicious to me:

web_1  | Jun 09 07:04:32.234 INFO <-- GET /api/v1/crates/A/1.1.0/download
web_1  | Jun 09 07:04:32.256 INFO <-- GET /api/v1/crates/B/0.4.0/download
web_1  | Jun 09 07:04:32.257 INFO --> GET /api/v1/crates/B/0.4.0/download 200 0ms, SQL error: database is locked
web_1  | Jun 09 07:04:32.261 INFO --> GET /api/v1/crates/A/1.1.0/download 200 26ms, OK
locked

I'm using the SQLite variant.

I guess this could also explain why the problem only happens on my CI server. It's in the same cloud datacenter, so the connection between the CI server and the alexandrie server is a lot faster.

I think I found the reason. I have exactly the same SQL error: database is locked leading to the cargo error failed to verify the checksum. And I also noticed that it only happens when more than two downloads are started nearly simultaneously, which is the case when the current project depends on two or more projects hosted by this alexandrie instance.

Cargo downloads all missing dependencies concurrently when invoking cargo publish or cargo build. If adding dependencies one by one (with cargo build in between) does not trigger this, but adding two of them at once before cargo build does, we can then be certain that the concurrency is the issue.

Maybe it is a failure related to database concurrency as the download should update the number of downloads in the database for the two crates.

I have checked that the checksums are correct in the index (same as running openssl sha256 on the uploaded .crate).

gzz2000 commented 1 year ago

Setting max_conn = 1 solves the problem for me.

Should this be set to a default setting when using SQLite as backend?

hpsjakob commented 1 year ago

I've configured max_conn = 1 but still see the problem.

[database]
url = "appdata/sqlite/alexandrie.db"
max_conn = 1

The error from the log file is:

Oct 19 11:01:42.629 INFO --> GET /api/v1/crates/XXX/1.0.0/download 200 0ms, SQL error: database is locked

I'm using the latest version as of today.

hpsjakob commented 1 year ago

I solved the issue for me now by switching to postgres db-backend.

bitbrain-za commented 11 months ago

Just experienced the same issue when building in a container. I also came right following the advice above.