backstage / backstage

Backstage is an open framework for building developer portals
https://backstage.io/
Apache License 2.0
26.9k stars 5.58k forks source link

🐛 Bug Report: zlib: unexpected end of file #18002

Closed awanlin closed 12 months ago

awanlin commented 1 year ago

📜 Description

We've started to see "zlib: unexpected end of file" errors in our logs. At first it was just in a single place but now we are seeing it in another. These both come places where we are using the UrlReader one with Bitbucket and the other with GitHub. We can't reproduce it locally but it happens in all of our deployed environments.

The error stack:

zlib: unexpected end of file, ZlibError: zlib: unexpected end of file
    at Unzip.write (/app/node_modules/minizlib/index.js:154:22)
    at Unzip.flush (/app/node_modules/minizlib/index.js:105:10)
    at Unzip.end (/app/node_modules/minizlib/index.js:111:10)
    at Unpack.end (/app/node_modules/tar/lib/parse.js:502:21)
    at endFn (node:internal/streams/pipeline:425:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)

Others have reported this on Discord as well: https://discord.com/channels/687207715902193673/1108389526440771624

👍 Expected behavior

There should be no errors and the UrlReader should work as expected

👎 Actual Behavior with Screenshots

We are getting this error:

zlib: unexpected end of file, ZlibError: zlib: unexpected end of file
    at Unzip.write (/app/node_modules/minizlib/index.js:154:22)
    at Unzip.flush (/app/node_modules/minizlib/index.js:105:10)
    at Unzip.end (/app/node_modules/minizlib/index.js:111:10)
    at Unpack.end (/app/node_modules/tar/lib/parse.js:502:21)
    at endFn (node:internal/streams/pipeline:425:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)

👟 Reproduction steps

Honestly this isn't one that we are able to reproduce. We've just seen it in our logs and working back from there they all are from using the UrlReader.

📃 Provide the context for the Bug.

This error is causing issues with a few internal plugins that required pulling in files from various SCM using the UrlReader which was working fine before.

🖥️ Your Environment

I can't get all the details but this started for use somewhere between 1.14.0-next.2 and 1.14.0

👀 Have you spent some time to check if this bug has been raised before?

🏢 Have you read the Code of Conduct?

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

benjdlambert commented 1 year ago

Hey @awanlin hmm looks like we're gonna need to find some more info here. I doubt that it would be something special that we've done, but happy to be disproved there! I wonder if this is an autobump of a dependency, either in the main repo which has propogated to others, or it's been bumped by other forms of dependency management.

The reason I say that we might not have had control over this is because I don't think we explicitly use any zlib stuff directly, only through dependencies.

Would be good to get some more information as to what these URL's are and see if we can create a test case that replicates it more often than not.

Just looking again at the error message it looks like the offending libraries could be either tar or minizllib. Can you check the git history in the yarn.lock for these packages and see if any of them have changed over the period of the releases you mentioned?

awanlin commented 1 year ago

Yeah @benjdlambert, that's where my thinking has gone around this issue as the stack itself is deep into other libraries and not any code that I recognize as coming from Backstage itself.

I'll do some digging into the yarn.lock as suggested and get back to you

awanlin commented 1 year ago

Just a bit of data so far:

Package versions by release:

Package 1.14.0-next.0 1.14.0-next.1 1.14.0-next.2 1.14.0
tar 6.1.13 6.1.14 6.1.15 6.1.15
minizlib 2.1.2 2.1.2 2.1.2 2.1.2

Release dates:

Release Date
1.14.0-next.0 April 25, 2023
1.14.0-next.1 May 2, 2023
1.14.0-next.2 May 10, 2023
1.14.0 May 16, 2023
awanlin commented 1 year ago

So no changes to minizlib over the period in time where we think the issue started but tar does have changes.

Here is the comparison from 6.1.13 to 6.1.14 - notice there are changes around minipass: https://github.com/isaacs/node-tar/compare/v6.1.13...v6.1.14

And here is the comparison from 6.1.14 to 6.1.15 - nothing jumps our at me: https://github.com/isaacs/node-tar/compare/v6.1.14...v6.1.15

awanlin commented 1 year ago

I'm going to try something here, bit of YOLO, and pinning tar to 6.1.13 which also brings minipass down to 4.2.8 from 5.0.0. Will report back on the results

awanlin commented 1 year ago

Sadly this did not work, got the same result - zlib: unexpected end of file

benjdlambert commented 1 year ago

hmm :( that's unfortunate. The other thing I could think it could be would be some change in backend-common, so maybe you could try pinning the version of backend-common to something before the 1.14.0-next.1 release? Hopefully the types are compatible 🤞

benjdlambert commented 1 year ago

Or rather some change in backend-common wether that be deps bumps or code itself that's caused an issue.

skyem commented 1 year ago

The template repo that I'm having the issue with is a base terraform repo with 9 files. It happens intermittently. Today I tried making a test scaffolder template with a duplicate test template repo. I found that when I limited it to 3 files (2 tfvars that are being changed, and an .md) it worked 100% of the time. Adding any files back caused it to fail again. It seems to be affected by load.

skyem commented 1 year ago

I however was unable to reproduce when running a local instance using the same templates and repos. Was hoping to be able to debug. It is much more performant locally, so if load related, that would fit.

ibrahimnasri24 commented 1 year ago

I am facing the same problem when running locally. I cannot find what causes it and can't always reproduce it. To solve it I just start the template over once or twice until the error is gone.

antoniomuso commented 1 year ago

I have the same error

gtema commented 1 year ago

"me too". The funny thing is that locally (in dev mode) it works fine (from time to time same template fails with rate of maybe 5%), in the container - fine, deploying container into the K8 and the failure rate is 100% (I am still trying to figure out what could be the reason for this difference, but looking to the initial issue message "We can't reproduce it locally but it happens in all of our deployed environments." it seems to fit)

benjdlambert commented 1 year ago

Can someone run the docker container locally and see if it still breaks? Is it the version of node inside the container, or some dependency? Or is it something to do with networking on the cluster side?

Any steps to reproduce so we can get this happening locally would be ideal for us to dig in a little more.

gtema commented 1 year ago

I wrote I tested it already (this surely may not be much representative):

To mention: I also tried to download the same archive what UrlReader is fetching inside of the container to exclude networking issues - no problems and tar is absolutely happy with it. Additionally inside the container I see part of the content was unpacked (in the /tmp there is a folder ***_step-fetch-base-*** with in my case 17 out of 27 files present). It is not dropped afterwards and stays till pod restart. I also tried mounting /tmp explicitly as a volume with no change.

gtema commented 1 year ago

ok, I actually got the failure with docker (podman) locally with a fresh create-app stuff:

https://github.com/gtema/backstage-zip-test

Built image according to https://backstage.io/docs/deployment/docker (I modified the dockerfile there to take use 18)

yarn install --frozen-lockfile
yarn tsc
yarn build:backend --config app-config.yaml
podman build . -t backstage

started container with:

podman run -it -p 7007:7007 backstage

In browser (localhost:7007) Went to Create > Demo > run and got the error during single step of the template (https://github.com/gtema/backstage-template/blob/main/demo/template.yaml) which fetches content from subfolder

gtema commented 1 year ago

update: tried all possible node 18 base images - all fail, but once I switched back to node:16-bullseye-slim the error gone.

Still very interesting that version of node (18) in container and locally is same, so apparently there is something with base image of node18, and not the node18 itself

awanlin commented 1 year ago

This is great work @gtema, thanks for taking the time to look into this more. We have been using node:18-bullseye-slim since February and the soonest this error started for us is late April. Wondering if there might be changes during that period we should look at?

gtema commented 1 year ago

At the moment I have no further ideas where to look at

gtema commented 1 year ago

but a very brief look on image layers shows that 18-slim has debian/tar 1.34+dfsg-1, while 16-slim: debian/tar 1.30+dfsg-6 Maybe error is really coming already from there

skyem commented 1 year ago

Very interesting, I will look more into this today.

skyem commented 1 year ago

I just saw this issue occur in a new place - during a techdocs docs build within the app.

info: Step 1 of 3: Preparing docs for entity component:default/my-thing {"timestamp":"2023-06-09T14:55:31.780Z"}
debug: Reading files from https://github.com/<ORG>/<REPO>/tree/main/ {"timestamp":"2023-06-09T14:55:31.814Z"}
error: Failed to build the docs page: zlib: unexpected end of file {"timestamp":"2023-06-09T14:55:34.732Z"}
skyem commented 1 year ago

On a deeper dive, I see that repo has no mkdocs.yml or docs directory. So this may be unrelated.

gtema commented 1 year ago

it is actually quite logical that it is related, since techdocs would also try to fetch whole repo as archive and get required part from it (as far as my experiments go)

skyem commented 1 year ago

I've validated that changing my deployment's Node from 18.12.0 to 16.20.0 (meaning changing the base image from node:lts-bullseye-slim to node:16-bullseye-slim) does fix the issue.

benjdlambert commented 1 year ago

OK so I just did some digging around and it looks like that this issue has been brought to the node team, but it's actually zlib that's causing this issue as the stream is malformed in the first place, and it's a little more strict now about how it's handled.

https://github.com/nodejs/node/issues/46359 https://github.com/madler/zlib/issues/773

I wonder if this is something that we need to bring up with github instead as it looks like their zlib compression is invalid?

benjdlambert commented 1 year ago

I would assume that the error still persists in v20 of node too right if someone can confirm?

jsundquist commented 1 year ago

We just had a user report the above error to our Backstage team. Looking back through our logs, it appears as though the error first showed up as far back as April 21. We are still investigating on our end to see what changes were deployed in or around this time.

frouwen commented 12 months ago

Had this same error for some, but not all techDoc entries. Tried @skyem solution and it worked for a bit, but stopped worked quickly for no apparent reason. Just updated from Backstage version 1.14.2 to 1.15.0 and the issue is fixed for me. (using Node v18.16.0) Keep up the good work!

Edit: Cheered too soon. A doc that had this error now works again, but another one broke due to this issue.

apc-kamezaki commented 12 months ago

Hi, I got same issue.

Backstage version is 1.15.0 (update today) . It was occured on Nodejs 18.16.0/18.16.1, but works well on Nodejs 18.13.0.

willhallonline commented 11 months ago

I had a similar issue building local techdocs with this on nodejs v18.16.0 and v18.16.1, reverted back to v16 (v16.20.1) and appears to have resolved the issues. I would prefer not to be on older versions of node, however, can't see an easy way around it currently.

gtema commented 11 months ago

I was also forced to go back to node 16. Latest backstage version with node18-bullseye-slim is not working, while with node16-bullseye-slim it works

benjdlambert commented 11 months ago

This should be fixed in the latest -next release if you want to try this? @gtema @willhallonline

Boggimedes commented 10 months ago

For anyone else running across this issue. Mine was caused by excessive logging during the build process. I had left debug logs that displayed the entire prop object. Removing the excess logging resolved the issue.

fredrondina commented 8 months ago

For anyone still getting this, bumping to node 18.18.1 fixed this for me

tuananh commented 6 months ago

we updated backstage to latest release version (as of this) and the error is gone. didn't pursue it further.

our instance was quite old (last update was April).

gritzkoo commented 4 months ago

I got the same issue in my backstage too, but to solve this problem I created an empty repository with only a README.md file on it, moved all files from the selected template repo to the new one, and this issue never showed again. In my original template repo, I removed the original .git folder and started all over again with git init then included all files and push code with --force which blew the Scaffolder plugin to fetch the repository.