eclipse-che / che

Kubernetes based Cloud Development Environments for Enterprise Teams
http://eclipse.org/che
Eclipse Public License 2.0
6.99k stars 1.19k forks source link

Stacks with custom images hang on wsagent install. #2732

Closed doreplado closed 7 years ago

doreplado commented 8 years ago

When creating a custom stack using compose format, if dev-machine image isn't one of the provided codenvy images (ie codenvy/node) then other layers of the stack hang on wsagent install during the apt-get update / install (debian in this case).

The hang is when it gets to "Processing triggers for libc-bin" in the update / install steps. After this, the install times out and the stack fails to come up.

Reproduction Steps:

  1. Create Stack with che-issue.txt
  2. try to bring the stack up.

Expected behavior:

The dev-machine, db and redis containers come up and successfully get the agent installed and you're able to start coding in the stack with terminal access to all 3 containers.

Observed behavior:

The stack start-up times out when running the ws-agent install on redis as it does the apt-get install. I believe it might be related to this issue

che-stack-issue

Che version: Nightly / 5.0.0-M5 OS and version: Ubuntu 16.04 (4.4.0-38-generic x86_64)
Docker version: Docker version 1.12.1, build 23cf638
Che install: Docker container via Instructions here.

Additional information:

Note: on line 44 of che-issue.txt, if you change the dev-machine image from node:4.6 to codenvy/node, the redis stack comes up fine. I've tried with various other node images. Both hand-rolled and community off of hub.docker.com. I've only had success with a codenvy image as the dev-image.

TylerJewell commented 8 years ago

Hi @doreplado - is it possible that you are running into this issue? https://github.com/eclipse/che/issues/2601. If your custom image is based upon debian, it looks like we are running into some problems.

doreplado commented 8 years ago

@TylerJewell It's possible. I'll try to reproduce it that way. The fact that I can change a single image name in the stack config and it works tells me its not that identical issue but perhaps its related.

doreplado commented 8 years ago

@TylerJewell I'm fairly certain this isn't the same issue and seems to be specific to agent install and image being used. I wasn't able to reproduce the default project info referenced in #2601. Everything seems to work ok (besides the reported behavior) with the template I've attached to the ticket.

TylerJewell commented 8 years ago

@vkuznyetsov @tolusha @jamesdrummond - can you assign for the team to investigate please? If this is a bug, then let's try to figure that out so we can determine severity.

@doreplado - do you have reasonable workaround to keep you moving forward temporarily? Our compose syntax was introduced in 5.0-M1, so we are working through some of the edge cases.

doreplado commented 8 years ago

@TylerJewell - yeah this isn't a show-stopper for us. I can use the codenvy/node image and it works fine. So we can POC with this. We didn't have an expectation of GA readiness of your Nightly builds :-D

tolusha commented 8 years ago

@TylerJewell If it is OS issue what solution we are supposed to have?

TylerJewell commented 8 years ago

@tolusha - I don't know if it is OS issue or not. I do not have the understanding of what is causing the problems here. So I would like to make sure that we have someone in support or dev assigned to investigate + explain the root cause + propose any possible solutions.

doreplado commented 8 years ago

@tolusha @TylerJewell if it ends up solely being a external factor, I'm happy to open issues with other projects (debian / docker / etc).

I don't believe its "solely" an OS issue since it works fine if I use codenvy docker image for dev-machine, but not with other community images.

And what is unique here is that with the provided config the container that hangs isn't the dev-image, but the redis container, and only if you don't use codenvy image for the dev-machine.

If it is an OS / Docker host issue, I suspect it'll be related to some shared resource where process namespace isolation doesn't matter.

If it will help, I can get a system call dump on a working and a failing scenario with sysdig and attach.

ddementieva commented 8 years ago

@doreplado I've tried your custom stack and it just works for me: image

I tried it in Eclipse Che with fairly good Internet connection.

Can you repeatedly reproduce this problem, as it seems to be caused by connectivity problems or some repositories were unavailable when you attempted to start a workspace?

When agent is added it downloads dependencies like curl that has its own dependencies.

doreplado commented 8 years ago

@ddementieva thanks for taking a look.

It is consistently reproducible. And to clarify, its not the download portion of the agent install it hangs on, but after the download when the packages are being processed.

However I understand that if you're unable to reproduce, there's not a lot you can do. I'll see if perhaps its the hypervisor solution I'm using (Ovirt.org) and try to run the same stack on a bare-metal system.

TylerJewell commented 8 years ago

@doreplado - there may be another work around, that I can suggest, but it may require some additional effort on your part. The agents are both installation + startup. The agents in their ZIP packagings have intelligence to download any libraries that are missing. The idea for doing this is so that we can have agents work with any base OS off the shelf.

However, it's a lot of unnecessary work to download software every time you want to start a workspace. So if you look at many of the base images that we reference in the stacks provided by Che (they are in github.com/eclipse/che-dockerfiles), we include all of the software that our agents require manually installed into the image. When the agents start off of our base images, they detect the needed software as already present, and instead, just start themselves.

You could repeat the exercise - and maybe this would solve the particuar issue you have. Though I'd prefere that we learn about what your core underlying issue is, because if there is a way for us to adapt the agents to avoid any hangups, I'd like to do so.

ghost commented 8 years ago

@doreplado i'll give it a shot on a Digital Ocean VM (CentOS). @ddementieva tried it locally on Ubuntu and it worked for her. SO, yeah, it's definitely something with where the container runs.

doreplado commented 8 years ago

@TylerJewell Ty for the suggestion. @eivantsov Ty for checking on DO

After the pull of the latest nightly (M6) this is only occurring intermittently.

@TylerJewell et al, I did have a question about the wsagent. If its better suited for another ticket, I can do that. Basically, I just wanted to put out the position that many people use docker to provide dev / prod parity from an environment perspective.

I definitely understand the advantages of leveraging the package system per distro-type to get dependencies. I'm sure it helps simplify compatibility greatly. However, you're also changing the runtime environment significantly to get the agent installed (and thus breaking that dev / prod parity).

These are just observations, not criticism. I have nothing but love for the Che project!

I do have a thought however. What would you think of creating an image with a chrooted install of the agent with all its dependencies statically built and if you're using non-codenvy images, you can do a "volumes-from" in your stack to include whats needed for the agent?

You should only have to account for 64bit platforms since docker doesn't support 32bit kernels.

Just a suggestion for a solution to go along with my outlined disadvantage of using the package management systems.

I'll continue testing and report back what I find. All of your help and direction is much appreciated :)

TylerJewell commented 8 years ago

Well intermittently working is slightly better than not working at all. So happy to see the progress there!

On your suggestion, we have been thinking about vokumes-from for quite awhile. It has its advantages and disadvantages that we are still sorting through. But yes it is one way to get the files into a workspace. Another way is that Che could act as a file server that lets curl the files needed into each workspace. But that also has limitations. Rsync could also be used.

To be explicit though the volumes from only resolves how to get files into the workspace. We still have to manage within the workspace a smart and efficient lifecycle of start and stop. And the agents would still have to have OS specific statements for how to run the code in each OS.

TylerJewell commented 7 years ago

@doreplado - any update please?

doreplado commented 7 years ago

I updated to the latest nightly and am attempting to reproduce again. I'm having difficulty understanding how to actually edit a stack. There seems to be dialog boxes but I can't actually edit things like the image etc. I'm sure I'll figure it out shortly and report back.

TylerJewell commented 7 years ago

We are working on the docs now for 5.0. So we probably have some gaps in the explanation and in the nightly release that you are doing. So you have to have a bit of a sense of adventure right now until we get the docs fully updated. We hope to have that done by the end of next week, and we will also be packaging up the docs to be hosted within Che itself under the /docs URL.

TylerJewell commented 7 years ago

@doreplado - all of our docs have been updated and there are some nice notes on adding / editing / removing stacks now. Can you retest? Otherwise we will close and assume that we got it all squared away.

ghost commented 7 years ago

I am closing the issue. Feel free to reopen if the issue persists.