AllenInstitute / deepinterpolation

Other
209 stars 50 forks source link

steps_per_epoch': ['Unknown field.'] #82

Closed frikyng closed 2 years ago

frikyng commented 2 years ago

Hi,

I just installed deepinterpolation and want to run the test data set that you are providing with the code. However, I get this error when I run the example python script from the anaconda console. Can you help? I tried checking the py script for anything wrong that I could have introduced but didn't find anything + I didn't make any changes to it anyway as it seems that the paths you have to change manually as per the documentation are now generated automatically. Can you help?

Thanks, Friedrich

(deepinterpolation) C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples>python cli_example_tiny_ephys_training.py
2022-01-31 12:31:14.849236: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-01-31 12:31:14.849335: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:root:train_path has been deprecated and is to be replaced by data_path as generators can be used for training and inference. We are forwarding the value but please update your code.
WARNING:root:pre_post_frame has been deprecated and is to be replaced by pre_frame and post_frame. We are forwarding the value but please update your code.
WARNING:root:train_path has been deprecated and is to be replaced by data_path as generators can be used for training and inference. We are forwarding the value but please update your code.
WARNING:root:pre_post_frame has been deprecated and is to be replaced by pre_frame and post_frame. We are forwarding the value but please update your code.
Traceback (most recent call last):
  File "cli_example_tiny_ephys_training.py", line 84, in <module>
    trainer = Training(input_data=args, args=[])
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\argschema\argschema_parser.py", line 175, in __init__
    result = self.load_schema_with_defaults(self.schema, args)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\argschema\argschema_parser.py", line 276, in load_schema_with_defaults
    result = utils.load(schema, args)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\argschema\utils.py", line 418, in load
    results = schema.load(d)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\marshmallow\schema.py", line 707, in load
    postprocess=True,
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\marshmallow\schema.py", line 867, in _do_load
    raise exc
marshmallow.exceptions.ValidationError: {'generator_params': {'steps_per_epoch': ['Unknown field.']}, 'test_generator_params': {'steps_per_epoch': ['Unknown field.']}}
aamster commented 2 years ago

Hi Friedrich, this argument got moved to training_params but the examples haven't been updated. Please delete the lines

generator_test_param["steps_per_epoch"] = -1

and

generator_param["steps_per_epoch"] = steps_per_epoch

and add a line

training_param['steps_per_epoch'] = steps_per_epoch

Please see #79

aamster commented 2 years ago

Duplicate of #79

frikyng commented 2 years ago

Thanks for the info! I have changed the file and the training works now. However, I get another issue in that the console displays a pipe error after the last batch has been processed. Can you tell what the issue is?

(deepinterpolation) C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples>python cli_example_tiny_ephys_training.py
2022-02-10 16:01:03.876726: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
WARNING:root:train_path has been deprecated and is to be replaced by data_path as generators can be used for training and inference. We are forwarding the value but please update your code.
WARNING:root:pre_post_frame has been deprecated and is to be replaced by pre_frame and post_frame. We are forwarding the value but please update your code.
WARNING:root:train_path has been deprecated and is to be replaced by data_path as generators can be used for training and inference. We are forwarding the value but please update your code.
WARNING:root:pre_post_frame has been deprecated and is to be replaced by pre_frame and post_frame. We are forwarding the value but please update your code.
INFO:Training:wrote C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples\2022_02_10_16_01_training_full_args.json
INFO:Training:wrote C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples\2022_02_10_16_01_training.json
INFO:Training:wrote C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples\2022_02_10_16_01_generator.json
INFO:Training:wrote C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples\2022_02_10_16_01_network.json
INFO:Training:wrote C:\Users\SunLab\Documents\FK\deepinterpolation\deepinterpolation_program_files\examples\2022_02_10_16_01_test_generator.json
2022-02-10 16:01:06.328756: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-02-10 16:01:06.329530: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-02-10 16:01:06.329596: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-02-10 16:01:06.331949: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-Q7NC8E2
2022-02-10 16:01:06.332069: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-Q7NC8E2
2022-02-10 16:01:06.332351: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-10 16:01:06.332768: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
INFO:Training:created objects for training
2022-02-10 16:01:06.605828: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/4
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
2022-02-10 16:01:07.472070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:09.792270: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:12.162427: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:14.485456: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:16.807172: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:19.089091: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:21.388160: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:23.691425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:26.074197: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:28.362816: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:30.641317: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:32.926921: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:35.216118: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:37.500432: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:39.790714: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:01:42.067839: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
10/10 [==============================] - ETA: 0s - loss: 0.5024WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
2022-02-10 16:03:00.258208: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:02.563265: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:04.848672: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:07.133003: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:09.428603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:11.731290: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:14.261508: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:17.216868: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2022-02-10 16:03:20.355943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
Exception in thread Thread-6:
Traceback (most recent call last):
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 748, in _run
    with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\site-packages\tensorflow\python\keras\utils\data_utils.py", line 727, in pool_fn
    initargs=(seqs, None, get_worker_id_queue()))
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\pool.py", line 176, in __init__
    self._repopulate_pool()
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
    w.start()
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\SunLab\.conda\envs\deepinterpolation\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe
jeromelecoq commented 2 years ago

It looks like it lost access to the data mid-way. Is it possible that the files were unaccessible for a bit ?

frikyng commented 2 years ago

I wouldn't say so. I am working locally with an SSD and didn't have much else running at the time that could have clogged up the data connection. I don't know how the end of the output is supposed to look like but the data of this epoch seem to be processed in full. At lest there is a 10/10 in the progress indicator.

jeromelecoq commented 2 years ago

This is the end of the first epoch? How is the validation data set provided?

frikyng commented 2 years ago

Can you elaborate please? I'm sorry but I am not familiar with this. What I did is literally just follow the instructions on the main page of this repo. i.e activate the environment, navigate to the folder with cli_example_tiny_ephys_training.py, and then run the thing in the terminal

jeromelecoq commented 2 years ago

Sure. The main part of the epoch when the training happens, tensorflow access the data provided by the training generator. When this is finished (showing 10/10 here), It jumps to access the validation data to measure performance. So I was wondering if the issue could be related to the validation dataset or "generator_test_param"

frikyng commented 2 years ago

Ah I see, thanks. You can inspect this file, that I use to run deepinterpolation. to_inspect_deepinterp_FK.txt I doubt though it's fishy. Can the CUDA version be an issue? I have the most recent version installed on this PC (11.6).

jeromelecoq commented 2 years ago

It could be CUDA. See here for the tested combination by tensorflow https://www.tensorflow.org/install/source#gpu

frikyng commented 2 years ago

Hmm the PC I am working on currently doesn't have an Nvidia graphics card, only the integrated Intel graphics. I tried to work around it but couldn't find a way. Is pointless to continue to try and make DeepInterp run? Otherwise I'll search for another PC/graphics card.

jeromelecoq commented 2 years ago

Any kind of deep learning work is better ran on GPUs. Those integrated cards only work for very small jobs probably. Some gaming cards are fairly inexpensive See here : https://lambdalabs.com/gpu-benchmarks The A100 are the rolls Royce now but there is a range of price. Most of my training was done on much older cards. You can check the method section of the paper

frikyng commented 2 years ago

Thanks for the info. I'll check some prices.

frikyng commented 2 years ago

Hey, so I have luckily found a PC with a decent graphics card (GTX 1050) and could finally do my first deep interpolation on some calcium data (with the example data provided). However, when transitioning to my own I get this error.

ValueError: A `Concatenate` layer requires inputs with matching shapes except for the 
concatenation axis. Received: input_shape=[(None, 64, 98, 1024), (None, 64, 99, 512)]

Do you know what the issue could be? The issue occurs in network_collection in local_network_function. Regarding my data, the only difference between your sample and my data is the resolution - 796x512. I have also checked if the dpi settings of the data interfered with the function and changed the metadata in FIJI. Thanks

jeromelecoq commented 2 years ago

Yes, changing the input size can have impact on the merging layers. You could try to feed a 1024x512 image in instead of 796x512 (filling in with zeros). I think that should prevent this issue.