galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.39k stars 1k forks source link

Remove (cleanup) any partial data left behind from a failed Data Manager execution #8919

Open jennaj opened 5 years ago

jennaj commented 5 years ago

Migrated out of: https://github.com/galaxyproject/galaxy/issues/1471

Issue: Data managers can make changes during initial stages of processing, then if they later fail, leave behind partial/incomplete data.

Example DMs with problems: Note: All DMs can do this, as far as I know. It depends on when/how the tool errors. They can also create duplicated data based on the same dbkey (causes tools that use those indexes to fail). Very hard/manual to fix. See below.

  1. Creates loc files first but if it fails when building the data (for any reason, usually some kind of technical/resource issue) can leave partial data around (not just the loc). Or did last two times I ran it (loc conflict first error, python incompatibility second error). Have records/histories that can share with admins. https://toolshed.g2.bx.psu.edu/view/iuc/data_manager_gemini_database_downloader/f57426daa04d
    1. Creates loc files first, then can fail when fetching/indexing data: https://toolshed.g2.bx.psu.edu/view/devteam/data_manager_fetch_genome_dbkeys_all_fasta/14eb0fc65c62. Also have records/histories for these.
    2. Have more examples. Cannot be shared publically.

Potential solutions:

  1. Isolate the entire job into a working/staging directory and only publish it to permanent tables/data storage destinations if the tool is fully successful.
    • This would involve some sanity checks to make sure any final publish "destinations" won't produce any errors (expecting a loc that doesn't already exist, not enough storage on disk to hold the data, etc).
  2. Or, Keep track of what is done at each step and undo it if the tool fails. At least put out a diff and importantly, a running log in case the job cannot "save" itself gracefully when quitting out for an error. See next.

What would also help the most:

  1. Some kind of "diff" about what was changed, exactly, per DM run.
    • We used to do this for ANY data change before moving it from a staging area into CVMFS. There was a "dry run" function for the publication step (produced a diff) -- so that if you created indexes, and wanted to move them to CVMFS (or anywhere else: staging to final location), you could check exactly what would be updated before committing the changes. Caught a lot of tiny issues/conflicts that if committed directly, would require manual cleanup.
    • Duplicated dbkeys are one item that is a huge "gotcha" for users -- they tend to need to completely start over with a fresh instance! Tools fail when there are duplicated dbkeys in any loc file. All of this could work in a local/cloud/docker as well -- give the admin a choice to review "diffs" before committing the changes. Or better, do not allow any DM to create a duplicated dbkey for any index to start with (check first). Allow an admin to override that (replace or not to replace.. anything but creating a duplicated table entry hanging off the same primary key (dbkey).
  2. Generate a log to keep track of comments during processing steps.
    • Ideally, this would be a log broken out into a distinct dataset output into the history, paired with the DM job. Please include the tool name + dbkey (when available) in both default dataset names...
    • DM name, version, a starting timestamp
    • Each step (what it is doing) and specifics (file paths, content)
    • Success/fail and if fail, the last step run with error details
    • End timestamp
    • Right now the processing is "blind", or at least in the GUI. It is probably in server logs somewhere, but not broken out/attached to the DM job itself. Job either succeeds or fails and stderr/stdout are not usually very helpful -- there is a lot of guesswork.

cc @mvdbeek @natefoo

mvdbeek commented 4 years ago

Can you confirm that for

2. Creates loc files first, then can fail when fetching/indexing data: https://toolshed.g2.bx.psu.edu/view/devteam/data_manager_fetch_genome_dbkeys_all_fasta/14eb0fc65c62. Also have records/histories for these.

the job in the history panel is green ? I see that this script can never actually fail, this isn't really a Galaxy issue for this one, but we do have to update the script.

For gemini this is a little trickier, I'm not familiar with it and don't know if it actually returns a proper exit code. If the output in the history is green, it isn't a Galaxy issue either.

I did dig into the code that moves the data and updates the data tables ... if the manager properly fails I don't see how this can run, because

  1. Isolate the entire job into a working/staging directory and only publish it to permanent tables/data storage destinations if the tool is fully successful.

That is what we do, but if the managers don't signal a problem (red dataset in the history) we'll move the output.

jennaj commented 4 years ago

Can you confirm that for

  1. Creates loc files first, then can fail when fetching/indexing data: https://toolshed.g2.bx.psu.edu/view/devteam/data_manager_fetch_genome_dbkeys_all_fasta/14eb0fc65c62. Also have records/histories for these.

the job in the history panel is green ? I see that this script can never actually fail, this isn't really a Galaxy issue for this one, but we do have to update the script.

Gemini is the last DM that I ran that did this. I'll share the history with you directly. Red dataset, partial locs leftover.