Download-only option - Githubissues

ECP-CANDLE / Benchmarks

ECP-CANDLE Benchmarks

MIT License

57 stars 82 forks source link

Download-only option #32

Open j-woz opened 5 years ago

j-woz commented 5 years ago

Allow user to invoke Benchmark in download-only mode, which will simply download the input data if it does not exist. This is necessary on supercomputers. This mode should not import keras or any other modules not required for data download.

jmohdyusof commented 5 years ago

See, for example, p3b1. The line fpath = fetch_data(gParameters) is basically what you want to run separately from the 'run' command. We can modify fetch_file to allow a different base Data directory location to address your other issue?

def fetch_data(gParameters):
    """ Downloads and decompresses the data if not locally available.
        Since the training data depends on the model definition it is not loaded,
        instead the local path where the raw data resides is returned
    """

    path = gParameters['data_url']
    fpath = candle.fetch_file(path + gParameters['train_data'], 'Pilot3', untar=True)

    return fpath

j-woz commented 5 years ago

That sounds good.

jmohdyusof commented 5 years ago

So probably a command like this should work for both tickets:

python benchmark --dl_only --basedir='/scratch/candle/'

j-woz commented 5 years ago

They will read that as Deep Learn only :) . How about --data-dir ? Will that be a standard flag for all Benchmark invocation? The default will be the current behavior (data directory == Benchmarks/Data).

jmohdyusof commented 5 years ago

Whatever we choose for keywords we can make part of the standard parser, so just decide on ones that don't conflict with other standard (keras/neon/etc) keywords.

--data_dir is fine (we currently use underscore, not dash, to separate words)

is --get_data_only clear enough without being too long?

j-woz commented 5 years ago

Yes, those are fine.

jmohdyusof commented 5 years ago

How strict is the 'don't import Keras' restriction? We need to be able to read the default_model file to get data locations, as well as import the command line parser, so this implies some sort of split between the initialize_parameters stage, the data load and the actual run. I think it makes sense to make the initialize_parameters a standalone function also.