amshinde commented 6 years ago

Currently we can only retrieve the logs when a build has failed with Jenkins. We should be able to retrieve them for successful builds as well to able able to inspect if we are running with the correct environment.

amshinde commented 6 years ago

@chavafg Can you take a look at this?

jodh-intel commented 6 years ago

I'm guessing this should really have been raised on https://github.com/clearcontainers/jenkins.

/cc @grahamwhaley as this might have implications for the metrics system storage requirements.

grahamwhaley commented 6 years ago

We should probably discuss and define which logs, and how much debug they have in them. If we take all the system logs and have all the CC debug enabled in the toml for instance then the logs come out pretty big (100's of Kb iirc), which we may not want to gather and store for every run. If we know what info we want in advance, then we could run some commands at startup such as cc-runtime cc-env, docker info and @jodh-intel 's magic system info collection script. We could even run all of those to gather into a file and add the file to the stored 'results archive' in Jenkins, which would help reduce pollution in the console output screen/log.

@chavafg I think it was recently pointed out that the metrics CI logs were already pretty big, and I should check that, as that is not intentional.

jodh-intel commented 6 years ago

For reference, that magic script is https://github.com/clearcontainers/runtime/blob/master/data/collect-data.sh.in.

@amshinde - can you give a concrete example where retaining logs would have helped? I'm not disagreeing that it's a good idea, but it would be good to explore if there are other ways to give you what you want.

How long do we think we'll need to store logs? "Forever" probably won't cut it so would a month (4 releases) be sufficient do you think?

But as @grahamwhaley's suggesting, I'm not sure we need to keep the logs as long as we can know the environment the tests ran in, to allow a test run to be recreated, namely:

[x] the commit version of every component.
[x] the runtime config.
[x] the version of the container manager being used.
[x] the container manager config.
[x] the version of the distro.
[ ] the package set being used (rpm -qa / dpkg -l).

As denoted by the checkboxes, the collect-data.sh script captures almost all we need here. The package set is the only missing item (although the script does capture the versions of any CC packages installed on the system already).

For reference, the output of the collect script when gzip -9'd is ~6k (for a system without any CC errors in the journal).

If we decide to store full logs for all PRs, we'll need something in place to warn about the ENOSPC that is almost guaranteed to happen one day... :smile:

jodh-intel commented 6 years ago

Oh - we might also want to include procenv output (see https://github.com/clearcontainers/jenkins/issues/5) for things like system limits, etc.

grahamwhaley commented 6 years ago

Agree on logs and longevity - I'm going to presume Jenkins has some plugin or setting that can manage and expire the gathered results files - and we should look at that indeed (we do collect up the .csv results files for the metrics for instance at present, but do not expire them)

grahamwhaley commented 6 years ago

procenv was the magic I was thinking of :-)

jodh-intel commented 6 years ago

Ah - soz - so much magic about! ;)

chavafg commented 6 years ago

I think @amshinde concern is to know the agent version, which at some point last week, we had a wrong version testing latest PRs. As for keeping the logs, I can add a rule to gather them in the Azure jenkins configuration, that way the metrics jenkins will not have any impact. But also the azure jenkins server may have storage issues in the future if we continue growing the logs we keep on every run. As @jodh-intel and @grahamwhaley said, it would be better to gather the information we require instead of getting all the logs from the execution.

jodh-intel commented 6 years ago

@chavafg - we could just run cc-collect-data.sh in the teardown script couldn't we? That way we get what info we want but also we ensure that script is being run regularly. If we need the complete list of packages, it would be easy to add an extra --all-packages option or similar.

chavafg commented 6 years ago

@jodh-intel yes, I think that would be the best. does cc-collect-data.sh collect the agent version? Because I have seen that it appears as unknown.

[Agent]
  Type = "hyperstart"
  Version = "<<unknown>>"

jodh-intel commented 6 years ago

@chavafg - good point! No, it doesn't.

I've had a think about this and I can think of two ways we could do this:

The gross hack

We could capture the agent version by adding something like a "--full" option to cc-collect.sh script. That option would run as normal, but would then:

enable full debug

change cc-collect-data.sh to run:

sudo docker run --runtime cc-runtime busybox true

look at the proxy messages in the system journal because the first message from the agent will contain it's version string.

But it's a hack ;)

The slightly-less gross option

Change the runtime so that it loop-mounts the currently configured container image read-only (with mount -oro,noatime,noload (thanks @grahamwhaley)) and then run cc-agent --version and grab the output.

That seems liked the best option but wdyt @grahamwhaley, @sboeuf, @sameo?

grahamwhaley commented 6 years ago

I had sometime very recently also considered we could loop mount the .img file and run the agent on the host with --version to extract that info. Either we can do that in the collect script or have the runtime do it. In the runtime feels a little skank, but I guess then we could in theory add the info into cc-env.

jodh-intel commented 6 years ago

I was having similar feelings about having that sort of code in the runtime too. That said, we do sort of have precedent if you look at cc-check.go which calls modinfo(8).

I'm happy for us to have this purely in the collect script but, yes, if it doesn't go in the runtime, we need to remove the Agent.Version field that @chavafg highlighted as currently it's static.

amshinde commented 6 years ago

@chavafg @jodh-intel @grahamwhaley Gathering the agent version was one of the requirements I had in mind, as we were running with a wrong agent last week. What I really wanted to have a look at were the CRIO logs, to take a look at the lifecycle events in the log and see that the container storage driver passed is actually the one being used with crio. I would say for successful builds, one would be interested in the logs typically just after the build, so I am ok with keeping this around for a week or even just for a couple of days.

grahamwhaley commented 6 years ago

It looks like in the Jenkins 'discard old builds' option we may also have the ability to specify how long to keep artifacts for btw.

clearcontainers / tests

Enable logs to be stored for successful CI builds #944

The gross hack

The slightly-less gross option