fix(Jenkinsfile): Fix jenkins pipline for CI

MrKevinWeiss commented 4 years ago

Contribution Description

It appears that the current stash and unstash has some issues with asynchronous behaviour. For example an unstash occurs on a node in a working directory that is different than what is expected when running a test. It also appears that some directories are not being cleaned.

The following PR makes a number of changes to fix that:

makes jenkinsfile declarative as this is better supported
run all tests on node before releasing to fix any shared workspace problems
Cleanup function names and steps to make it more readable
Add timeouts to overall process of 1 hour and per node at 45 mins
Handle errors if unstash fails only stop the node

Testing Procedure

Check the CI, I don't know how that params will effect everything but it is better than now and we can always fix later if an issue occurs

Related Issues

Checks some boxes on #66

MrKevinWeiss commented 4 years ago

I think I need to add the params and get rid of the RIOT submodule change.

MrKevinWeiss commented 4 years ago

Also it seems like I am not getting the notifications... I don't know why... yet!

MrKevinWeiss commented 4 years ago

I am wondering if the overall timeout is a good idea as it may cause some problems if there are many jobs in the queue since the nodes can be blocked for a long time but the master ticker may still be going...

cgundogan commented 4 years ago

I am wondering if the overall timeout is a good idea

I guess having that global timeout throughout the job lifetime is good. It's very unlikely, but if we observe hangs in the setup or notification phase (or future stages), then the global timeout seems to be our only rescue? Of course, we could also wrap each of that stage in a local timeout .. but the global one is more convenient.

MrKevinWeiss commented 4 years ago

I guess having that global timeout throughout the job lifetime is good.

Good, the problem is the timing, if I start 10 jobs at once the last one would have to wait for the nodes to be complete, meaning my timeout would need to be some function of running jobs or something (it would take at least 3 hours to run through 10 jobs).

Anyways currently it is set at 1 hour, I think that is fine if we don't have to wait for other jobs to finish with the node but that currently is not the case. What would be a good balance?

MrKevinWeiss commented 4 years ago

Darn also it seems like the catching of the errors prevents timeouts and aborts. Maybe for the time I increase everything to something that should work and we can tune later once I figure out how to capture error types (ie a timeout occured or a stop message occured).

MrKevinWeiss commented 4 years ago

Oh man... the timeout actually seems not too nice..

MrKevinWeiss commented 4 years ago

Maybe it is ready. Still could use some work but there was at least one case where the timeouts and exiting worked out well. It would be nice to get this in by the end of the day.

cgundogan commented 4 years ago

Is there a test run that I can look at? Back at Jenkins I couldn't find any

MrKevinWeiss commented 4 years ago

Looks like there is still some work. Darn. Will take care of tomorrow.

MrKevinWeiss commented 4 years ago

I tried to simplify the catchError command since I need to use try catch anyways. The problem is now I cannot see failures in the stages. The catchError allowed me to set buildResult and stageResult but it appears I don't have that control with the currentBuild global variable. I guess I am really struggling with the documentation on what I have access to.

Should I just call it quits and have a catchError with a try catch that allows me to throw the caught error while outside the catchError context or can we accept that things look like they are passing when they are not (we still get correct test results) or should I continue to search for a way where I can try catch and only fail that stage?

For some reason the robot-test fail case seems to function properly as the unstable setting is showing up.

MrKevinWeiss commented 4 years ago

I confirmed the node timeout only starts ticking after the node is acquired. I set it to 1 hour and the whole process to 3 hours.

There are still some strange things happening when we try to stop and it is trying to change states but it just requires an additional stop and it seems fine. I think we can leave it for now as we have yet to get too many lockup problems.

MrKevinWeiss commented 4 years ago

Thanks for all the help!

RIOT-OS / RobotFW-tests