Closed Simon-Harris-IBM closed 4 years ago
Modified line 110 & 130 of docker_agent_workflow.cwl such that parentid="#adminUploadSynId". This forces the stdout from running the docker image ("
The fact that we have a fanout mechanism complicates matters because the submitterid of the docker image to the main fanout queue is different to the submitterid to the main execution queue.
Let's focus initially on getting the files correctly placed/authorisations for the backend execution queue.
Ideally, we want "
@thomasyu888 Is this possible ?
Ah, yes i totally forgot about this. It is possible. We would need to have a tool to obtain the lock and none lock folder of the submission from the fanout queue. I will work on this sometimes tomorrow.
Thanks Tom..... Having just talked with MIT, they also want to hide the toil log files in the 'express-lane'. All they want output is the stdout/stderr from the docker container.
Just to summarise, this is how we'd like to have the files placed:
This should be the same for both express and main queues. One potential issue is that 3. may create some security holes as participants will be able to submit an image to try and find/print the gold label file and then use this to cheat.
But we can discuss this internally. I think if we get the above layout working, we can fine tune it later.
Thanks....
Thanks for the summary. Here are my responses.
SHARE_RESULTS_IMMEDIATELY=false
. Unfortunately, this removes the text to tell participants where their log files are.fanout
queue LOCK folder (not the LOCK folder generated by the main queues)fanout
queue folderI will need to write a tool to get the participants fanout folder
and folder-LOCK
.
I set SHARE_RESULTS_IMMEDIATELY=false
in the .env files but all output files are now in the _LOCKED folder. There is nothing in the unlocked folder.
My understanding is the following line should have uploaded the results.json file into the unlocked folder:
# Upload the validated scores so we have a record
upload_validation:
run: https://raw.githubusercontent.com/Sage-Bionetworks/ChallengeWorkflowTemplates/v2.1/upload_to_synapse.cwl
in:
- id: infile
source: "#validate_and_score/results"
- id: parentid
#source: "#adminUploadSynId"
source: "#get_docker_submission/submitter_synid"
- id: used_entity
source: "#get_docker_submission/entity_id"
- id: executed_entity
source: "#workflowSynapseId"
- id: synapse_config
source: "#synapseConfig"
out:
- id: uploaded_fileid
- id: uploaded_file_version
- id: results
See https://www.synapse.org/#!Synapse:syn21750580 for an example.
@thomasyu888 any idea what might be wrong ?
@Simon-Harris-IBM Ah, after doing some investigation, I realized that by setting SHARE_RESULTS_IMMEDIATELY=false
, the orgSagebionetworksSynapseWorkflowOrchestratorSubmissionFolder
gets set to the _LOCK
folder which is why everything gets uploaded into there. For now, please set this back to SHARE_RESULTS_IMMEDIATELY=true
Here are some solutions.
Unfortunately, Bruce is out of office, so I he won't get to fix this. My proposed fix can be found here: https://github.com/Sage-Bionetworks/SynapseWorkflowOrchestrator/issues/21.
We tell participants to just ignore the .zip folder
This solution is a bit more cumbersome, but we can create separate folders completely separate from what the orchestrator creates. (I'm not a huge fan of this workflow, but if we are in a time crunch...)
a. Participant submits to main queue b. A CWL tool creates a folder inside of a submission_results folder we create in Synapse and grant access to only the team that submitted.
submission_results/
submission_results/submissionid/
submission_results/submissionid/results.json
submission_results/submissionid/submission_id.logs
submission_results/submissionid2/
...
c. We annotate the main queue submission with this folder id and pass this along to the other queues.
d. We set SHARE_RESULTS_IMMEDIATELY=false
, then participants will know nothing about the file structure the orchestrator creates.
e. We would need to write a notification tool that emails participants with the location of their log file and set SUBMITTER_NOTIFICATION_MASK to not send the submission process started email. (We would need to do this for 1. as well)
Thanks @thomasyu888 Is there no way to simply copy the 2 files from the _LOCKED folder into the shared folder ? The workflow orchestrator makes some mention of this here: https://github.com/Sage-Bionetworks/SynapseWorkflowOrchestrator#uploading-results
@Simon-Harris-IBM , So I myself am unsure what that external process looks like. Maybe try setting DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID
to your synapse user id (yours is 3400238). So
Maybe that would set both annotations for the lock folder and the non_lock folder. Please try to run a submission with
SHARE_RESULTS_IMMEDIATELY=false
DATA_UNLOCK_SYNAPSE_PRINCIPAL_ID=3400238
Please let me know when the run is complete, I will check it out
Tested a run using my simonh id:
So there doesn't appear to be any difference than just setting SHARE_RESULTS_IMMEDIATELY=false
@Simon-Harris-IBM , Thats what I suspected. I think this external process was actually done by crawling through the file structure to move files to the appropriate folder. I still think 1. is the best option, but Bruce is out until Wednesday.
How did you get the shared files synapse id?
I also think #1 will be the best solution. Let's wait until Bruce returns to see what he thinks.
I've annotated the dashboards with all the info from the results.json file. This will at least mean participants will not have to look for the results.json file to get their results - they can see them all right there in the dashboard. So the only file we really have to worry about copying over is the _logs.txt file..... See: https://www.synapse.org/#!Synapse:syn21445379/wiki/601551
@Simon-Harris-IBM Can you explain how you obtained the shared filed synapse id in more detail? The handling of this id is an important part of this.
My guess is that the submitterUploadSynId is being replaced with the locked folder synapse id. Please see here: https://www.synapse.org/#!Synapse:syn21445381/wiki/601587. Notice how your last two runs in the main queue have the same annotation for Log Folder and Admin Folder?
@thomasyu888 I got the shared file synapse id just by looking at the files tab and scrolling to the bottom and checking the creation time :-)
Hi @thomasyu888 - is there any update from Bruce on this issue ? Thanks -- Simon
@Simon-Harris-IBM , I filed an issue and have an email that is scheduled to send out tomorrow in the AM.
@thomasyu888 don't mean to push, but is there any update? This is one of the items on MITs hit list and I'd like to get it ticked off if at all possible. Thanks -- Simon
@Simon-Harris-IBM, Bruce saw my email and responded. Not sure of timeline of implementation, but I will let you know as soon as it's done.
@thomasyu888 After applying PR #37 and setting SHARE_RESULTS_IMMEDIATELY=false
on ALL the submission queues (fanout & backend) this is now the placement of output & log files:
So, we just need to copy the _log.txt and result.json files from the LOCKED folder to the UNLOCKED user folder.
Change the inputs in the internal workflow for the upload synid to the tool that gets the folder id from the main queue.
@Simon-Harris-IBM,
Can you give me a submission id from a main submission queue and its matching internal queue to debug what is going on
@thomasyu888 I'll get this to you tomorrow Thomas. Synapse is currently under maintenance and not showing the latest submissions in the queue. I was testing a solution just as it went down, so if we're lucky it might be working ok. Will keep you posted. Thanks Simon
Hi @thomasyu888 . Please see submission 9702407 at: https://www.synapse.org/#!Synapse:syn21445381/wiki/601587 The job failed uploading the results.json file. The error is "Cannot find a node with id: 9702407" Logs are at: https://www.synapse.org/#!Synapse:syn21818640 Thanks -- Simon
@Simon-Harris-IBM, please try submitting now. Notice I updated the column header in the dashboard. The log folder associated submissions with the internal queues now are irrelevant. It's sort of a pain because the submitter folder and admin folder appear the same in the main dashboard you linked above, but they aren't.
@thomasyu888 After the latest round of updates this is now working perfectly....
Closing this issue.
We need to have proper placement of all output files generated by a submission - which include the zipped log files, predictions.cvs and results.json.
The files should be placed as in such a way that: