drone-plugins / drone-s3-cache

Caches build artifacts to S3 compatible storage backends
http://plugins.drone.io/drone-plugins/drone-s3-cache
Apache License 2.0
28 stars 30 forks source link

Add Timeout functionality #51

Open jbrockopp opened 5 years ago

jbrockopp commented 5 years ago

Currently, if we experience issues with our S3 cache service provider, it can cause the plugin to "hang". This causes the build to be stuck in a "running" state until the build reaches the timeout (usually 1 hour).

A single occurrence of the issue is not a problem, but when it happens across the entire enterprise, it causes the number of pending builds in the queue to grow massively because our agent capacity is consumed by all the "stuck" cache builds.

The logs produced in the step are:

time="2019-03-14T13:29:38Z" level=info msg="No path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No fallback_path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No flush_path specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="No filename specified. Creating default"
time="2019-03-14T13:29:38Z" level=info msg="Installing new ca certificate at /etc/ssl/certs/ca-certificates.crt"
time="2019-03-14T13:29:38Z" level=info msg="Restoring cache at /prod-drone-cache/drone/toss-test/master/archive.tar"
time="2019-03-14T13:29:38Z" level=info msg="Retrieving file in prod-drone-cache at drone/toss-test/master/archive.tar"
time="2019-03-14T13:29:38Z" level=info msg="Copying object from the server"

This leads me to believe it's getting hung on line #93 because we never see the log message from line #99.

https://github.com/drone-plugins/drone-s3-cache/blob/327bde5b538acb78a6a595e27bffb5a5ed76cfce/storage/s3/s3.go#L91-L101

To remediate this issue, we decided to create our own timeout feature that we embedded into the plugin and drone/drone-cache-lib.

Would this be a feature the Drone admins were willing to accept into the plugin? I've attached the changesets below so you can see how we went about resolving this issue.

Repo Changes:

tboerger commented 5 years ago

IMHO it's a really interesting feature, other opinions?

bradrydzewski commented 5 years ago

we might be able to simplify by using context and WithTimeout like this: http://ixday.github.io/post/golang-cancel-copy/

-numBytes, err := io.Copy(dst, object)
+err := Copy(ctx, dst, object)

I am not sure if the code in the referenced blog post is any good. But I like the approach of abstracting this to a separate function that resembles io.Copy but accepts and uses context.

jbrockopp commented 5 years ago

@bradrydzewski I agree that would be a much simpler implementation and I'm happy to try and submit something to that nature.

My only concern is, what if the plugin "hangs" on a different part of the function? We'd have to be very clear on the implementation that the timeout flag for the plugin was only a timeout for how long to wait for the copy function.

To be specific, what if the minio GetObject function hangs inside the Get function?

https://github.com/drone-plugins/drone-s3-cache/blob/327bde5b538acb78a6a595e27bffb5a5ed76cfce/storage/s3/s3.go#L71-L102

Sure, we could swap to using the GetObjectWithContext function but then we deal with having the context be applied to both the Copy function (your pseudo code provided above) and the GetObjectWithContext function.

This could have unintended consequences if both the GetObjectWithContext and Copy function don't exceed the timeout, but combined together they could exceed the timeout.

Do you have thoughts on that?

Baccata commented 5 years ago

I seem to be having a similar issue, when the AWS IAM role available to drone does not let him create a bucket : The logs indicate that

Logs (seems to be hanging there)

1 | time="2019-06-25T10:34:54Z" level=info msg="No path specified. Creating default"
-- | --
2 | time="2019-06-25T10:34:54Z" level=info msg="No fallback_path specified. Creating default"
3 | time="2019-06-25T10:34:54Z" level=info msg="No flush_path specified. Creating default"
4 | time="2019-06-25T10:34:54Z" level=info msg="No filename specified. Creating default"
5 | time="2019-06-25T10:34:54Z" level=info msg="Rebuilding cache at /omelois/drone-test/master/archive.tar"
6 | time="2019-06-25T10:34:54Z" level=info msg="Rebuilding cache at [pipo] to /omelois/drone-test/master/archive.tar"
7 | time="2019-06-25T10:34:54Z" level=info msg="Uploading to bucket omelois at drone-test/master/archive.tar"

Pipeline

kind: pipeline
name: default

steps:
- name: restore
  image: plugins/s3-cache
  settings:
    pull: true
    restore: true

- name: build
  image: bash
  commands:
    - mkdir pipo
    - echo "hello" > pipo/hello.txt

- name: rebuild
  image: plugins/s3-cache
  settings:
    pull: true
    rebuild: true
    mount:
      - pipo
    when:
      event: push
wadeholler commented 4 years ago

Our IAM role is enabled - other s3 actions are being performaned from the same kubernetes cluster that our drone install runs in - and I think we are experiancing the same behavior as @Baccata describes.


time="2020-05-21T18:46:44Z" level=info msg="No path specified. Creating default"
--
2 | time="2020-05-21T18:46:44Z" level=info msg="No fallback_path specified. Creating default"
3 | time="2020-05-21T18:46:44Z" level=info msg="No flush_path specified. Creating default"
4 | time="2020-05-21T18:46:44Z" level=info msg="No filename specified. Creating default"
5 | time="2020-05-21T18:46:44Z" level=info msg="Rebuilding cache at /obfuscated-drone-s3-mvn-cache/~obfuscated/obfuscated-java/master/archive.tar"
6 | time="2020-05-21T18:46:44Z" level=info msg="Rebuilding cache at [.m2/repository] to /obfuscated-drone-s3-mvn-cache/~obfuscated/obfuscated-java/master/archive.tar"
7 | time="2020-05-21T18:46:44Z" level=info msg="Uploading to bucket obfuscated-drone-s3-mvn-cache at ~obfuscated/obfuscated-java/master/archive.tar"
wadeholler commented 4 years ago

update - this was solved by entering the endpoint for our region of aws. still struggling with how to make it "backup" /root/.m2/repository and restore it before the mvn build though.