Simon-Harris-IBM / ObjectNetChallenge-Workflows

Workflows for ObjectNet Challenge
0 stars 1 forks source link

Proper placement of output files and log files #18

Closed Simon-Harris-IBM closed 4 years ago

Simon-Harris-IBM commented 4 years ago

We need to have proper placement of all output files generated by a submission - which include the zipped log files, predictions.cvs and results.json.

The files should be placed as in such a way that:

  1. log files for a main submission should only be accessible by the queue admin and NOT submitter
  2. predicitions.csv should only be accessible by the by the queue admin and NOT submitter
  3. results.json should be accessible by the queue admin and the submitter
Simon-Harris-IBM commented 4 years ago

Modified line 110 & 130 of docker_agent_workflow.cwl such that parentid="#adminUploadSynId". This forces the stdout from running the docker image ("_log.txt") and predictions.csv file to be placed into a LOCKED directory only the q admin can view. Which is good. Also modified line 180 to parentid="#submitterUploadSynId" to upload the results.json using the submitter id.

The fact that we have a fanout mechanism complicates matters because the submitterid of the docker image to the main fanout queue is different to the submitterid to the main execution queue.

Let's focus initially on getting the files correctly placed/authorisations for the backend execution queue.

Ideally, we want "_logs.zip", "_log.txt" & predictions.csv all in a single folder to which only the queue admin has access, and results.json in a folder which both the queue admin and original submitter have access to.

@thomasyu888 Is this possible ?

thomasyu888 commented 4 years ago

Ah, yes i totally forgot about this. It is possible. We would need to have a tool to obtain the lock and none lock folder of the submission from the fanout queue. I will work on this sometimes tomorrow.

Simon-Harris-IBM commented 4 years ago

Thanks Tom..... Having just talked with MIT, they also want to hide the toil log files in the 'express-lane'. All they want output is the stdout/stderr from the docker container.

Simon-Harris-IBM commented 4 years ago

Just to summarise, this is how we'd like to have the files placed:

  1. toil logs ("_logs.zip") should not be accessible & hidden to participants
  2. predictions.csv should not be accessible & hidden to participants
  3. stdout/stderr from the docker containers ("_log.txt") should be accessible to participants
  4. results.out should be accessible to participants

This should be the same for both express and main queues. One potential issue is that 3. may create some security holes as participants will be able to submit an image to try and find/print the gold label file and then use this to cheat.

But we can discuss this internally. I think if we get the above layout working, we can fine tune it later.

Thanks....

thomasyu888 commented 4 years ago

Thanks for the summary. Here are my responses.

  1. I think toil logs can be disabled by setting a .env parameter in the orchestrator. SHARE_RESULTS_IMMEDIATELY=false. Unfortunately, this removes the text to tell participants where their log files are.
  2. Upload this file into the fanout queue LOCK folder (not the LOCK folder generated by the main queues)
  3. You can upload this into the admin LOCK folder like point 2 above for the main queues, but give them to participants for the express lane submissions
  4. Upload this file into the fanout queue folder

I will need to write a tool to get the participants fanout folder and folder-LOCK.

Simon-Harris-IBM commented 4 years ago

I set SHARE_RESULTS_IMMEDIATELY=false in the .env files but all output files are now in the _LOCKED folder. There is nothing in the unlocked folder. My understanding is the following line should have uploaded the results.json file into the unlocked folder:

  # Upload the validated scores so we have a record
  upload_validation:
    run: https://raw.githubusercontent.com/Sage-Bionetworks/ChallengeWorkflowTemplates/v2.1/upload_to_synapse.cwl
    in:
      - id: infile
        source: "#validate_and_score/results"
      - id: parentid
        #source: "#adminUploadSynId"
        source: "#get_docker_submission/submitter_synid"
      - id: used_entity
        source: "#get_docker_submission/entity_id"
      - id: executed_entity
        source: "#workflowSynapseId"
      - id: synapse_config
        source: "#synapseConfig"
    out:
      - id: uploaded_fileid
      - id: uploaded_file_version
      - id: results

See https://www.synapse.org/#!Synapse:syn21750580 for an example.

@thomasyu888 any idea what might be wrong ?

thomasyu888 commented 4 years ago

@Simon-Harris-IBM Ah, after doing some investigation, I realized that by setting SHARE_RESULTS_IMMEDIATELY=false, the orgSagebionetworksSynapseWorkflowOrchestratorSubmissionFolder gets set to the _LOCK folder which is why everything gets uploaded into there. For now, please set this back to SHARE_RESULTS_IMMEDIATELY=true Here are some solutions.

  1. Unfortunately, Bruce is out of office, so I he won't get to fix this. My proposed fix can be found here: https://github.com/Sage-Bionetworks/SynapseWorkflowOrchestrator/issues/21.

  2. We tell participants to just ignore the .zip folder

  3. This solution is a bit more cumbersome, but we can create separate folders completely separate from what the orchestrator creates. (I'm not a huge fan of this workflow, but if we are in a time crunch...)

a. Participant submits to main queue b. A CWL tool creates a folder inside of a submission_results folder we create in Synapse and grant access to only the team that submitted.

submission_results/
submission_results/submissionid/
submission_results/submissionid/results.json
submission_results/submissionid/submission_id.logs
submission_results/submissionid2/
...

c. We annotate the main queue submission with this folder id and pass this along to the other queues. d. We set SHARE_RESULTS_IMMEDIATELY=false, then participants will know nothing about the file structure the orchestrator creates. e. We would need to write a notification tool that emails participants with the location of their log file and set SUBMITTER_NOTIFICATION_MASK to not send the submission process started email. (We would need to do this for 1. as well)

Simon-Harris-IBM commented 4 years ago

Thanks @thomasyu888 Is there no way to simply copy the 2 files from the _LOCKED folder into the shared folder ? The workflow orchestrator makes some mention of this here: https://github.com/Sage-Bionetworks/SynapseWorkflowOrchestrator#uploading-results

thomasyu888 commented 4 years ago

@Simon-Harris-IBM , So I myself am unsure what that external process looks like. Maybe try setting DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID to your synapse user id (yours is 3400238). So

Maybe that would set both annotations for the lock folder and the non_lock folder. Please try to run a submission with

SHARE_RESULTS_IMMEDIATELY=false
DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID=3400238

Please let me know when the run is complete, I will check it out

Simon-Harris-IBM commented 4 years ago

Tested a run using my simonh id:

So there doesn't appear to be any difference than just setting SHARE_RESULTS_IMMEDIATELY=false

thomasyu888 commented 4 years ago

@Simon-Harris-IBM , Thats what I suspected. I think this external process was actually done by crawling through the file structure to move files to the appropriate folder. I still think 1. is the best option, but Bruce is out until Wednesday.

How did you get the shared files synapse id?

Simon-Harris-IBM commented 4 years ago

I also think #1 will be the best solution. Let's wait until Bruce returns to see what he thinks.

I've annotated the dashboards with all the info from the results.json file. This will at least mean participants will not have to look for the results.json file to get their results - they can see them all right there in the dashboard. So the only file we really have to worry about copying over is the _logs.txt file..... See: https://www.synapse.org/#!Synapse:syn21445379/wiki/601551

thomasyu888 commented 4 years ago

@Simon-Harris-IBM Can you explain how you obtained the shared filed synapse id in more detail? The handling of this id is an important part of this.

My guess is that the submitterUploadSynId is being replaced with the locked folder synapse id. Please see here: https://www.synapse.org/#!Synapse:syn21445381/wiki/601587. Notice how your last two runs in the main queue have the same annotation for Log Folder and Admin Folder?

Simon-Harris-IBM commented 4 years ago

@thomasyu888 I got the shared file synapse id just by looking at the files tab and scrolling to the bottom and checking the creation time :-)

Simon-Harris-IBM commented 4 years ago

Hi @thomasyu888 - is there any update from Bruce on this issue ? Thanks -- Simon

thomasyu888 commented 4 years ago

@Simon-Harris-IBM , I filed an issue and have an email that is scheduled to send out tomorrow in the AM.

Simon-Harris-IBM commented 4 years ago

@thomasyu888 don't mean to push, but is there any update? This is one of the items on MITs hit list and I'd like to get it ticked off if at all possible. Thanks -- Simon

thomasyu888 commented 4 years ago

@Simon-Harris-IBM, Bruce saw my email and responded. Not sure of timeline of implementation, but I will let you know as soon as it's done.

Simon-Harris-IBM commented 4 years ago

@thomasyu888 After applying PR #37 and setting SHARE_RESULTS_IMMEDIATELY=false on ALL the submission queues (fanout & backend) this is now the placement of output & log files:

So, we just need to copy the _log.txt and result.json files from the LOCKED folder to the UNLOCKED user folder.

thomasyu888 commented 4 years ago

Change the inputs in the internal workflow for the upload synid to the tool that gets the folder id from the main queue.

thomasyu888 commented 4 years ago

@Simon-Harris-IBM,

Can you give me a submission id from a main submission queue and its matching internal queue to debug what is going on

Simon-Harris-IBM commented 4 years ago

@thomasyu888 I'll get this to you tomorrow Thomas. Synapse is currently under maintenance and not showing the latest submissions in the queue. I was testing a solution just as it went down, so if we're lucky it might be working ok. Will keep you posted. Thanks Simon

Simon-Harris-IBM commented 4 years ago

Hi @thomasyu888 . Please see submission 9702407 at: https://www.synapse.org/#!Synapse:syn21445381/wiki/601587 The job failed uploading the results.json file. The error is "Cannot find a node with id: 9702407" Logs are at: https://www.synapse.org/#!Synapse:syn21818640 Thanks -- Simon

thomasyu888 commented 4 years ago

@Simon-Harris-IBM, please try submitting now. Notice I updated the column header in the dashboard. The log folder associated submissions with the internal queues now are irrelevant. It's sort of a pain because the submitter folder and admin folder appear the same in the main dashboard you linked above, but they aren't.

Simon-Harris-IBM commented 4 years ago

@thomasyu888 After the latest round of updates this is now working perfectly....

Closing this issue.