behavior surprise when gdc_dice follows multiple gdc_mirror calls

gsaksena commented 7 years ago

If you run gdc_mirror selecting for sample A, then run it again selecting for sample B, and finally run gdc_dice, then only sample B will get diced. This behavior is unexpected. There may be a related issue with gdc_loadfile.

At the minimum, the documentation should be fixed to make it clear about the scope of what will be diced. In addition, it may or may not make sense to change the behavior to make it easier to use.

gsaksena commented 7 years ago

An issue to consider if changing the behavior is coherency in cases when the data was not all downloaded at once. In FH this was achieved by always making loadfiles based off of the latest sdrf, which in turn pointed to the exact versions of the archives the diced files came from (in most cases the latest version). In the current version of gdctools, coherency is achieved by requiring the files in the loadfile to have been downloaded by the most recent download (which may have resulted in a cache hit and so actually downloaded earlier).

Perhaps dice could check that the scope of files it is being asked to dice in the current run are all up to date, before running. Similarly, perhaps gdc_loadfile could check that 1) the mirrored input files are up to date, and 2) that the diced files are dated later than the input files. Like a makefile. Both gdc_dice and gdc_loadfile could be forced to skip these checks via -f. If no scope is given to gdc_dice or gdc_loadfile, implicit is the accumulation of everything that was mirrorred; if a narrower scope is given, then just that part is operated on; if a wider scope is given, an error is thrown.

gsaksena commented 7 years ago

Code editing plans: 1) shunt everything to a single common date stamp

unconditionally set date stamp to 'ds' 2) change file writing to be control-c safe: write to tmp file first then move
metadata files - json file for mirror, tsv files for dice
data files - mirror files, diced files 3) clean up double-caching issue for diced files (use only file existence and not also metadata existence to determine caching) 4) clean up date stamp detritus - remove ds from parameter list, file paths

gsaksena commented 7 years ago

David and I discussed this. While the plan above seems fine for the independent analyst use case, it breaks important GDAC use cases related to redaction handling: 1) If a file was redacted since the initial mirroring, we do not want it to be included in downstream dicings or loadfiles. 2) We want to preserve past metadata files, both to diff to see changes as well as to use as a freeze list. The existing behavior supports the GDAC redaction handling use cases fine today.

In order for an analyst to get the behavior they want, they need to specify the universe of data they are interested in when they do each mirroring, appending to the commandline or config file each time they add more, and trusting the caching to work so it does not take forever. This is somewhat inconvenient for interactive use, and does not allow certain cohorts to pull fewer data types than others.

I propose the following changes to my previous plans:

1) Instead of unconditionally setting the datestamp to a constant, set it to a constant by default (or explicitly via --date pooled), but have the GDAC use a commandline switch (--date latest) to revert to the old behavior (ie using an actual date). For the analyst, the effect will be appending new data to old to form one big pool; for the GDAC, the metadata will be from a GDC query at a single point in time.

4) Omit this - leave the datestamp in the file paths etc.

Making the analyst use case the default would be a behavior change. My take is that a larger number of users would prefer the simpler model of thinking about the data as a single pool. But, it sounds like this would benefit from Mike's input.

gsaksena commented 7 years ago

modifying branch gsaksena_mirror_mirror_dice

gsaksena commented 7 years ago

Updated gsaksena_mirror_mirror_dice branch with (untested) code implementing my proposal.

The existing behavior for explicit dates and 'latest' are unchanged. The 'new' option for the date code parameter has been added, to attain the current behavior one gets when not specifying a date code at all. The 'pooled' option for the date code has also been added, giving behavior more favorable for interactive command-line analysts. I've made the 'pooled' option the default, so GDAC code would need to be updated to use the 'new' option when mirroring and continue to use 'latest' for dicing and loadfile generation.

We should discuss appropriate tests to add.

In addition to the proposal, I made some changes to when the mirroring metadata is written to disk, which should make the behavior more predictable in the case of download failures or control c. Now, the metadata is flushed to disk after each category is mirrorred, not just after the whole project finishes being mirrored. But, the metadata is not recorded if one or more files in the category failed to mirror. In this way, presence of mirroring metadata indicates a category is ready to dice.

noblem commented 7 years ago

Guys, while this discussion is useful I think what would help even more is drawing a simple state machine diagram. It would not have too many states, but would clarify immensely what we're trying to capture (incompletely, it seems) in prose.

gsaksena commented 7 years ago

I've drafted a sort of flow chart describing how the metadata files are handled... The changes I am proposing are indicated by **'s for new and strikeouts for old.

https://docs.google.com/a/broadinstitute.org/document/d/16Q63btK7UITLoiUn-e9rOXIl3-ISeFhDPnW_KHXVXRY/edit?usp=sharing

noblem commented 7 years ago

A good start, but still just text. As noted earlier, it would be useful to clarify the text into a true state diagram

noblem commented 7 years ago

See #53 for a related issue

broadinstitute / gdctools

behavior surprise when gdc_dice follows multiple gdc_mirror calls #47