broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

"Copying directories not supported" on AWS #4982

Closed vortexing closed 4 years ago

vortexing commented 5 years ago

Ran a workflow using V40 Cromwell on AWSBATCH that had as outputs (one outputfile.vcf for each of the shards in the workflow):

output {
    Array[File] outputs = task.outputvcf
}

I used the following workflow options:

{
    "workflow_failure_mode": "NoNewCalls",
    "default_runtime_attributes": {
        "maxRetries": 1
    },
    "final_workflow_outputs_dir": "s3://bucket/Cromwell/results",
    "use_relative_output_paths": "false",
    "final_workflow_log_dir": "s3://bucket/Cromwell/workflowLogs",
    "final_call_logs_dir": "s3://bucket/Cromwell/workflowLogs"
}

All calls of the workflow completed successfully but the workflow itself failed.

Error Message I got: "copying directories is not yet supported: s3://s3.amazonaws.com/bucket/Cromwell/results/workflowName/1ec38d0b-afc4-4cd5-90f1-f015395d6e36/call-task/shard-0/outputfile.vcf"

Oddly enough, the correct prefixes for the output files were created in the correct S3 bucket, they just don't have an object there, and via the CLI they appear as directories. ???

For the logs, a prefix was made that is empty, and the log file was written successfully to one level higher than the prefix it is supposed to be in. So instead of:

s3://bucket/Cromwell/workflowLogs/workflowName/<workflowid>.log

there is:

s3://bucket/Cromwell/workflowLogs/workflowName/ (empty prefix)
s3://bucket/Cromwell/workflowLogs/<workflowid>.log (successfully written file)

Thoughts?

robthompsonweb commented 5 years ago

running into the same issue as well.

java -Dconfig.file=cromwell-aws.conf -jar cromwell-42.jar run file_copy_test.wdl --options cromwell-options.json

Tried with V40,41,42 and same issues..

Quick test script hacked together from other scripts: file_copy_test.wdl

workflow WGS_BAM_to_GVCF {
    String input_file = "s3://bucket/file"

    # Merge per-interval GVCFs
    call MergeGVCFs {
            input:
                input_file = input_file
     }

    # Outputs that will be retained when execution is complete
    output {
            File output_vcf = MergeGVCFs.output_vcf
    }
}

#### TASKS ####

# Merge GVCFs generated per-interval for the same sample
task MergeGVCFs {
    File input_file
    String output_file_name = "output.txt"

    Int machine_mem_gb = 2
    Int command_mem_gb = machine_mem_gb - 1

    command {
      echo ${input_file} > ${output_file_name}
  }

    runtime {
            docker: "ubuntu"
        memory: "${machine_mem_gb}G"
        cpu: 1
    }

    output {
            File output_vcf = "${output_file_name}"
    }
}

cromwell_options.json

{
"final_workflow_outputs_dir":"s3://3-bucket",
"use_relative_output_paths":true,
"final_workflow_log_dir":"s3://s3-bucket/wf_logs"
}

error:

[2019-06-15 19:50:15,63] [error] WorkflowManagerActor Workflow c9dd69e1-121e-45bc-911f-92d6bb6a2074 failed (during FinalizingWorkflowState): cromwell.engine.io.IoAttempts$EnhancedCromwellIoException: [Attempted 1 time(s)] - IllegalArgumentException: copying directories is not yet supported: s3://s3bucket/WGS_BAM_to_GVCF/c9dd69e1-121e-45bc-911f-92d6bb6a2074/call-MergeGVCFs/output.txt
Caused by: java.lang.IllegalArgumentException: copying directories is not yet supported: s3://s3bucket/c9dd69e1-121e-45bc-911f-92d6bb6a2074/call-MergeGVCFs/output.txt
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)
    at org.lerch.s3fs.S3FileSystemProvider.copy(S3FileSystemProvider.java:420)
    at java.nio.file.Files.copy(Files.java:1274)
    at better.files.File.copyTo(File.scala:663)
    at cromwell.core.path.BetterFileMethods.copyTo(BetterFileMethods.scala:425)
    at cromwell.core.path.BetterFileMethods.copyTo$(BetterFileMethods.scala:424)
    at cromwell.filesystems.s3.S3Path.copyTo(S3PathBuilder.scala:160)
    at cromwell.engine.io.nio.NioFlow.$anonfun$copy$1(NioFlow.scala:84)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
    at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
    at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:351)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:372)
    at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:312)
    at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
wleepang commented 5 years ago

Root cause is from the s3fs.S3FileSystemProvider library. Support for copying directories appears to have been implemented in this fork.

vortexing commented 5 years ago

ha.... so Lee, we should just use Nextflow eh? :)

wleepang commented 5 years ago

That was the only version of the library I could quickly find that had support for directory copying. It's worth looking into what changes were necessary.

wleepang commented 5 years ago

A little more research suggest that a TransferManager: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/index.html?com/amazonaws/services/s3/transfer/TransferManager.html

Might be what is needed to support copying of both files and directories in this method: https://github.com/broadinstitute/cromwell/blob/19d86f07d5618760e420cc218c7ff7749711d6ba/filesystems/s3/src/main/java/org/lerch/s3fs/S3FileSystemProvider.java#L411

SergeySdv commented 5 years ago

@vortexing I'm currently working on this issue. Can you please provide your full WDL and JSON files. And what is the aim of your workflow? I need to make sure that this is not the issue with path building in the cromwell methods. Thx.

vortexing commented 5 years ago

Did you try the test WDL and json provided by robthompsonweb? Any of my workflows have this same error and they all have essentially that same basic structure.

TimurKustov commented 5 years ago

@robthompsonweb as i understand, you expect the output.txt to be a file(not a directory) located at "s3://3-bucket/WGS_BAM_to_GVCF/workflow/call-MergeGVCFs/output.txt/ and logs at s3://s3-bucket/wf_logs. Am i right?

robthompsonweb commented 5 years ago

@TimurKustov correct (apart from the trailing slash on output.txt.. not sure if that is a typo but that will break it on s3. it would copy output.txt to s3://3-bucket/WGS_BAM_to_GVCF/workflow/call-MergeGVCFs/output.txt. ie aws s3 cp localpath/WGS_BAM_to_GVCF/workflow/call-MergeGVCFs/output.txt s3://3-bucket/WGS_BAM_to_GVCF/workflow/call-MergeGVCFs/output.txt

I would expect it to copy every file from the output into the path specified by final_workflow_outputs_dir. if use_relative_output_paths is false it would be: s3://3-bucket/output.txt

TimurKustov commented 5 years ago

We are ready to submit PR with fix for this issue and we performed manual testing on several backends (AWS, GCP, Local), but there are some troubles with creation of integration test for that: in particular, we didn't find the way to pass cromwell options to cromwell running in server mode. Is this possible and is integration test required for this issue? @wleepang

geoffjentry commented 5 years ago

@TimurKustov You just need to specify it in the centaur test description of , with a pointer to where the option file lives

svitkovsergey commented 5 years ago

@geoffjentry Could you also please help with another question about *.test files: is it possible to run multiple separate workflows (wdl or cwl) in one test? Not importing as sub-workflows, but as a completely independent workflows? Thanks!

geoffjentry commented 5 years ago

@likeanowl Not really. What are you looking to do here?

svitkovsergey commented 5 years ago

@geoffjentry I trying to write some kind of integration test for my fix of this task and me with @TimurKustov came to idea of executing two workflows sequential in order to get outputs, results and call logs copied after execution of the first workflow and assure that they are exist and correctly placed by running second workflow, which would check these files locations and existence.

svitkovsergey commented 5 years ago

@wleepang I've added TransferManager and some other improvements in PR https://github.com/broadinstitute/cromwell/pull/5110 Could you please review it?

svitkovsergey commented 5 years ago

@wleepang @geoffjentry Any updates about PR #5110 ? :smile:

wleepang commented 5 years ago

@likeanowl - Took a look at the PR. Overall, looks good, but had a couple questions. Do the new integration tests you mention cover the points I brought up - i.e. mostly around default credentials use and default region config?

juhawilppu commented 5 years ago

I'm getting the same error when using

    output {
        Array[File] files = glob("*.txt")
    }
aednichols commented 4 years ago

PR #5110 has merged, closing