MeteoSwiss / ldcast

Latent diffusion for generative precipitation nowcasting
Apache License 2.0
105 stars 13 forks source link

Can't cache sampler_nowcaster_test #10

Open caglarkucuk opened 1 year ago

caglarkucuk commented 1 year ago

Context

After successfully running python forecast_demo.py and python train_autoenc.py --model_dir="../models/autoenc_train", I couldn't get the python train_genforecast.py --model_dir="../models/genforecast_train" command running due to problems in caching the sampler for test and training set.

When running python train_genforecast.py --model_dir="../models/genforecast_train":

Expected behaviour

  1. Creates the sampler files and save to the cache directory for valid, test, and train datasets
  2. Trains the forecaster

Actual behaviour

  1. Creates the file ../cache/sampler_nowcaster_valid.pkl
  2. Throws an error creating the next sampler as (complete error message pasted below):
    ~/tmp/0606/ldcast/scripts$ python train_genforecast.py --model_dir="../models/genforecast_train"
    Loading data...
    /home/kucuk/tmp/0606/ldcast/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
    log_scale = np.log10(scale).astype(np.float32)
    Loading cached sampler from ../cache/sampler_nowcaster_valid.pkl.
    No cached sampler found, creating a new one...
    Traceback (most recent call last):
    File "train_genforecast.py", line 129, in <module>
    Fire(main)
    File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
    File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
    File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
    File "train_genforecast.py", line 125, in main
    train(**config)
    File "train_genforecast.py", line 94, in train
    datamodule = setup_data(
    File "/home/kucuk/tmp/0606/ldcast/scripts/train_nowcaster.py", line 124, in setup_data
    datamodule = split.DataModule(
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 127, in __init__
    self.batch_gen = {
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 128, in <dictcomp>
    split: batch.BatchGenerator(
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/batch.py", line 81, in __init__
    self.sampler = EqualFrequencySampler(
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 30, in __init__
    self.starting_ind = [
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 31, in <listcomp>
    starting_indices_for_centers(
    File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 210, in starting_indices_for_centers
    starting_ind = np.concatenate(
    File "<__array_function__ internals>", line 180, in concatenate
    ValueError: need at least one array to concatenate
  3. Seems like an issue with indexing of patches in the sampler, though I'm not sure...

Additional information

37 directories, 149 files



Could it be the sth related to version compatibilities of packages, e.g., dask or numba? Perhaps I'm missing something in the `data` directory. 
@jleinonen please let me know how I can provide further information - and thanks in advance!
jleinonen commented 1 year ago

Hi @caglarkucuk, you say above that you ran

python train_autoenc.py --model_dir="../models/autoenc_train"

successfully and then that the same command failed. Did you mean to have a different command on the second line?

caglarkucuk commented 1 year ago

Sorry, I pasted the wrong command while creating the issue. Edited the original post, apologies for the confusion

jleinonen commented 1 year ago

This is a very strange bug. The LDM training uses the same sampler code as the autoencoder training. So I don't understand why you would get the latter to work but not the former. Could you paste the layout of your data directory and a longer traceback of the error? Also maybe remove the sampler_nowcaster_*.pkl files from the cache directory and try to run it again to see if it reoccurs?

wangjn2018 commented 1 year ago

Hi @jleinonen, I found the same error as @caglarkucuk mentioned when I ran the command /python train_autoenc.py --model_dir="../models/autoenc_train" in the directory of scripts/. I have downloaded the data you provided and put them in the directory of data/. No sampler files were produced in the directory of cache/. Thanks in advance!

caglarkucuk commented 1 year ago

Thanks @jleinonen for the quick response. I updated the original issue to provide further information, based on your suggestions.