broadinstitute / gdctools

Python and UNIX CLI utilities to simplify interaction with the NIH/NCI Genomics Data Commons
Other
31 stars 4 forks source link

Nothing mirrored, but dicing still kicked off #9

Closed noblem closed 7 years ago

noblem commented 7 years ago

Notice that for the last few mirroring attempts (done via cron)

    % grep " new " /xchip/gdac_data/gdc/logs/mirror/gdcMirror.2016_11_0[789]*.log

there were 0 new files downloaded. Despite this, the subsequent dicing attempt, per the lines

  bin/gdc_mirror    $Config
  bin/gdc_dice      $Config

in the new gdac_ingestor wrapper, and the corresponding logs at

% wc -l /xchip/gdac_data/gdc/logs/dice/gdcDice.2016_11_0[789]*.log 135828 /xchip/gdac_data/gdc/logs/dice/gdcDice.2016_11_0701_10_11.log 135828 /xchip/gdac_data/gdc/logs/dice/gdcDice.2016_11_08__01_09_42.log 135828 /xchip/gdac_data/gdc/logs/dice/gdcDice.2016_11_0901_08_16.log

all show that the dicer is attempting to dice stuff ... AND ... there are corresponding entries in the metadata subdirs for each of these dates where "no new data" was mirrored.

So, I suppose that in our testing so far we just avoided issuing a dicing when we saw that no new data was downloaded. But we should be figuring out from the mirror whether that is the case and not attempting to dice. I thought this we were already doing this, but maybe that kind of avoidance is only happening in the mirror?

noblem commented 7 years ago

Another data point: the same thing is observed if one repeatedly runs "make test": no new files will be mirrored, but dicing and loadfile generation will proceed, with a newer, unique version stamp that is only a minute or two later than the previous "make test".

It seems useful to make the default be that no new data means no new dicing, at least for the ingestor. Ditto for loadfile generation and sample reports: w/in the ingestor proper why do vacuous work? But outside of the ingestor, it seems reasonable that others would want to generate their own loadfiles or reports, etc. Esp for AWG work, and making new aggregate cohorts, etc. Thoughts?

tmdefreitas commented 7 years ago

A couple things to add to the discussion, since there were some reasons the dicer appears to be a bit dumber than it has to.

Every time the mirror is run, we make API calls to retrieve metadata about each file. The metadata is not guaranteed to be the same between runs of the mirror even if the set of files is the same (imagine new tags, annotations, etc.), so we save the results of the API calls each time (Though we could check and only save new metadata if there are differences**). The dicer works from this metadata, and chooses the latest version available, even if the values are the same as an old run. Importantly, the dicer doesn't assume that the previous dicing was accurate, so it checks to make sure each file that it expects to have diced does in fact exist. For any file that doesn't, it performs the appropriate dicing action. A dicing run that dices no new files therefore still checks the integrity of the entire diced folder structure, even if it makes no changes. I think there is some value in that check, but I admit the value diminishes in returns as our dicing annotations and process stabilizes.

\ Though it is nice that saving metadata for each mirror serves as a record of what the metadata was on a particular date, and shows when the last mirror was performed without having to search for the logs.

Does that make sense?