ENCODE-DCC / caper

Cromwell/WDL wrapper for Python
MIT License
52 stars 18 forks source link

metadata.json has changed significantly #102

Open biodavidjm opened 3 years ago

biodavidjm commented 3 years ago

I have noticed that the latest caper version 1.4.2 does not constantly write the metadata.json file, which I guess it's a good thing because certainly, the server crashed if too many files have to be written on the metadata.json file. So this is great because a gigantic job that I run using the previous caper version was always crashing, and now it completes the job.

But I still need the metadata.json file. The way I found to generate the metadata.json file is by running this command:

caper metadata e8c2155f-ee2c-4eac-8aa9-a32cdbbd4de0 > metadata.json

But to my surprise, there are many changes in the structure of the file, for example:

  1. When the job is re-run and most of the previous (failed) runs are cached, most of the values printed in the metadata.json file do not work anymore, i.e, the "job / bucket id" is not updated. This means that if the metadata shows, for example:
gs://proteomics-pipeline/results/proteomics_msgfplus/87b2ae48-889a-4ece-a83e-2f0e77122392/call-msconvert_mzrefiner/shard-0/stdout

in reality, that stdout is not in that bucket folder, but in the previous job

gs://proteomics-pipeline/results/proteomics_msgfplus/e8c2155f-ee2c-4eac-8aa9-a32cdbbd4de0/call-msconvert_mzrefiner/shard-0/stdout
  1. The json key-value "commandLine" has disappeared and it's not available anymore!! any other way to find the command that was run?
  2. Could the metadata.json be written to the original bucket where all the output data is located instead of to the local VM folder from where caper metadata command is run?

Thanks a lot for this great tool!

leepc12 commented 3 years ago

What Cromwell version are you using? Please check cromwell= in your conf file ~/.caper/default.conf. If you don't have it there then you are using default Cromwell 52 (Caper v1.4.2).

I think those changes in metadata.json are due to change of Cromwell versions. Old caper uses old Cromwell.

1: I think it's a known bug of Cromwell. Upgrade Cromwell to the latest and see if it's fixed. I also observed that call-cached task sometimes have wrong file paths (which should not even exist due to call-caching) written in metadata.json.

  1. I actually don't know about the key commandLine. You can look into script file to get actual command lines for a task.

  2. You can use gsutil cp.

    $ caper metadata WORKFLOW_ID > metadata.json
    $ WORKFLOW_ROOT=$(cat `metadata.json` | jq .workflowRoot)
    $ gsutil cp metadata.json $WORKFLOW_ROOT/
biodavidjm commented 3 years ago

Hi @leepc12

  1. I am using the latest version available when caper is installed, i.e. cromwell-52.jar. What version should I use? this version of Cromwell was working perfectly fine with caper 1.4.1
  2. It was available in this line of the metadata.json file
        "pipeline.masic": [
            {
                "preemptible": false,
                "executionStatus": "Done",
                "stdout": "gs://whatever-pipeline/results/pipeline/85b88fd2-9669-4fd0-b605-7a2cd23591b6/call-masic/shard-0/stdout",
                "backendStatus": "Success",
                "compressedDockerSize": 254658935,
                "commandLine": "echo \"STEP 0: Ready to run MASIC\"\n\nmono /app/masic/MASIC_Console.exe \\\n/I:/cromwell_root/proteomics-pipeline/test/raw/global/MoTrPAC_Pilot_TMT_W_S1_01_12Oct17_Elm_AQ-17-09-02.raw \\\n/P:/cromwell_root/proteomics-pipeline/parameters/TMT10_LTQ-FT_10ppm_ReporterTol0.003Da_2014-08-06.xml \\\n/O:output_masic",
                "shardIndex": 0,

That commandLine key value was very useful and it is now gone

  1. Yes, that I know it ;-) But it would be great if caper would take care of putting the metadata.json file there, just as before 1.4.2. For example, write the metadata.json file to the working directory once the job is "done" or "failed" without the need of calling the command and create the file locally.

Again, thank you very much for the great tool

leepc12 commented 3 years ago
  1. Try with the latest one. Find URLs for cromwell and womtool on Cromwell's github releases page and define them in the conf file (cromwell=http://.../cromwell-VER.jar and womtool=http://.../wotmool-VER.jar).

  2. I actually don't know, Caper just wraps Cromwell's REST API to retrieve metadata.

  3. Maybe I can add some parameter like caper metadata WORKFLOW_ID --write-on-workflow-root so that it writes to a file on the bucket instead of printing out to STDOUT.

biodavidjm commented 3 years ago

Maybe I can add some parameter like caper metadata WORKFLOW_ID --write-on-workflow-root so that it writes to a file on the bucket instead of printing out to STDOUT.

Adding that parameter would be extremely helpful! Thanks a lot!