Perilla-lab / TEMNet

3 stars 1 forks source link

error executing the software #3

Open rmarabini opened 2 years ago

rmarabini commented 2 years ago

I tried to use the program with no result. In the following I describe what I did.

1) clone the repository 2) create a virtual environment and install the python modules (requirements.txt) 3) download databases (bash ./dataset/download_dataset.sh) 4) unzip de downloaded databases. Both zip files create the same directory 'backbone_dataset' and some files have the same name 5) execute augment-images.py script There are two copies of this script one in the directory "etc" and the other in "dataset" In the following I describe the errors that appear when using the script located at "dataset" a) the script assume the existence of a directory called rcnn_dataset_full that does not exist. I linked rcnn_dataset_full to backbone_dataset b) The execution returns the message:

2022-07-05 10:29:56.092803: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
['mature', 'eccentric', 'immature']
Number of images to process: 3
#############################################################################
Processing image #0

Loading image from ./rcnn_dataset_full/train/mature/mature.png
Traceback (most recent call last):
  File "/home/roberto/scipion3/software/em/TEMNet/./dataset/augment-images.py", line 470, in <module>
    expand_images_crops(crop_size, step_size, TRAIN_PATH, './rcnn_dataset_augmented/train', rewrite= True)
  File "/home/roberto/scipion3/software/em/TEMNet/./dataset/augment-images.py", line 429, in expand_images_crops
    img = load_img(LOAD_PATH)
  File "/home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/keras/utils/image_utils.py", line 393, in load_img
    with open(path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: './rcnn_dataset_full/train/mature/mature.png'

c) indeed the file './rcnn_dataset_full/train/mature/mature.png' does not exist.

Any help regarding how to process will be welcome


By the way, I also tried to execute the GUI but it seems that the version of my libgc is not compatible with the one assumed by the binary.

[38082] Error loading Python lib '/home/roberto/scipion3/software/em/TEMNet/GUI/TEMNet/libpython3.8.so.1.0': dlopen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /home/roberto/scipion3/software/em/TEMNet/GUI/TEMNet/libpython3.8.so.1.0)
thanks for the help
jsreyl commented 2 years ago

Hi @rmarabini ! Thanks for letting us know about this issue.

I just updated the filenames for the datasets when they are downloaded with the bash script, they were swapped previously. Unzipping the files should generate two directories called _backbonedataset/ and _rcnn_datasetfull/, each of which contains training and validation folders.

The augmentate-images.py script assumes you are in the dataset folder an that the datasets were downloaded there, sorry this was not explicit before I updated the instructions in the README. Downloading the datasets and image augmentation should work now by running cd dataset/ bash download_dataset.sh unzip backbone_dataset.zip unzip rcnn_dataset_full.zip python3 augment-images.py Let me know if this helps.


I'm looking into the libgc requirements for the GUI, which OS are you using?

rmarabini commented 2 years ago

Thanks for the advice, I will try it and let you know the results.

rmarabini commented 2 years ago

hui,,

Thanks for the help. I am using

 "Debian GNU/Linux 10 (buster)

as operating system

regards

Roberto

On Tue, Jul 5, 2022 at 9:59 PM jsreyl @.***> wrote:

Hi @rmarabini https://github.com/rmarabini ! Thanks for letting us know about this issue.

I just updated the filenames for the datasets when they are downloaded with the bash script, they were swapped previously. Unzipping the files should generate two directories called backbone_dataset/ and rcnn_dataset_full/, each of which contains training and validation folders. The augmentate-images.py script assumes you are in the dataset folder an that the datasets were downloaded there, sorry this was not explicit before I updated the instructions in the README. The downloading the datasets and image augmentation should work now by running cd dataset/ bash download_dataset.sh unzip backbone_dataset.zip unzip rcnn_dataset_full.zip python3 augment-images.py Let me know if this helps.

I'm looking into the libgc requirements for the GUI, which OS are you using?

— Reply to this email directly, view it on GitHub https://github.com/Perilla-lab/TEMNet/issues/3#issuecomment-1175446115, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQGTISMQSQVZ6FHANCHFADVSSH33ANCNFSM52VN3HEQ . You are receiving this because you were mentioned.Message ID: @.***>

rmarabini commented 2 years ago

H @jsreyl

I have been able to progress executing the augment-images.py script but I get an error when executing this lines of code

  elif atype=='salt-pepper':
    #Add gaussian noise of mean 0 and stddev 1
    aug_image=image/255
    noise = tf.random.normal(shape=tf.shape(image), mean=0.0, stddev=1.0, dtype=tf.float32)
    aug_image = tf.add(image, noise)
    aug_image=aug_image*255
    aug_x, aug_y, aug_w, aug_h = x, y, w, h

the error states that in line

aug_image = tf.add(image, noise)

we are trying to add a uint8 array (image) and a float32 array (noise).

So the questions are:

1) I think this line is wrong

aug_image = tf.add(image, noise)

and should be

aug_image = tf.add(aug_image, noise)

note that both aug_image and noise are float arrays

2) After

aug_image=aug_image*255

the array aug_image should be converted to int8 may be with

aug_image = aug_image.astype(np.uint8)

Let me know if you think that these two modifications should be included in the code.

jsreyl commented 2 years ago

hui,, Thanks for the help. I am using "Debian GNU/Linux 10 (buster) as operating system regards Roberto

Thanks, I'll set up a virtual machine with Debian to investigate the library requirements.

jsreyl commented 2 years ago

H @jsreyl

I have been able to progress executing the augment-images.py script but I get an error when executing this lines of code

  elif atype=='salt-pepper':
    #Add gaussian noise of mean 0 and stddev 1
    aug_image=image/255
    noise = tf.random.normal(shape=tf.shape(image), mean=0.0, stddev=1.0, dtype=tf.float32)
    aug_image = tf.add(image, noise)
    aug_image=aug_image*255
    aug_x, aug_y, aug_w, aug_h = x, y, w, h

the error states that in line

aug_image = tf.add(image, noise)

we are trying to add a uint8 array (image) and a float32 array (noise).

So the questions are:

1. I think this line is wrong

aug_image = tf.add(image, noise)

and should be

aug_image = tf.add(aug_image, noise)

note that both aug_image and noise are float arrays

2. After

aug_image=aug_image*255

the array aug_image should be converted to int8 may be with

aug_image = aug_image.astype(np.uint8)

Let me know if you think that these two modifications should be included in the code.

Thank you for the report. I was able to reproduce this error when running a python environment with tensorflow>2.1. Tensorflow 2.1 which is currently required for the training pipeline does run the augmentation without explicit typecasting to int8.

For compatibility with tensorflow >= 2.1 (which can be used for inference or backbone pretraining) I just pushed changes to the code ( https://github.com/Perilla-lab/TEMNet/commit/cc3ce970e333b5486aee954427318dbae492c04e) casting the image as uint8 as you suggested: aug_image=tf.cast(aug_image*255,dtype=tf.uint8)

Let me know if this helps.

rmarabini commented 2 years ago

Hi @jsreyl,

Now the augment-images.py script runs successfully ;-) so I moved to the training procedure. I changed to the directory scripts/rcnn and executed the command

python3 train.py -b temnet -g 1

the output is the following error message.

===== TensorFlow Version 2.9.1 Number of devices recognized by Mirror Strategy: 2 Training for RPN: False classes:Dataset: reading data from /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/train classes:Dataset: reading data from /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/val /home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/keras/optimizers/optimizer_v2/gradient_descent.py:108: UserWarning: The lr argument is deprecated, use learning_rate instead. super(SGD, self).init(name, **kwargs) Traceback (most recent call last): File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/train.py", line 70, in train_model(args.backbone, args.weights, args.gpu) File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/train.py", line 43, in train_model rcnn = RCNN(config, 'train') File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 674, in init self.compile() File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 1669, in compile self.keras_model.add_loss(loss) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/keras/engine/base_layer.py", line 1386, in add_loss self._graph_network_add_loss(symbolic_loss) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/keras/engine/functional.py", line 870, in _graph_network_add_loss self._insert_layers(new_layers, new_nodes) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/keras/engine/functional.py", line 813, in _insert_layers layer_set = set(self._self_tracked_trackables) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.10/site-packages/tensorflow/python/training/tracking/data_structures.py", line 677, in hash raise TypeError("unhashable type: 'ListWrapper'") TypeError: unhashable type: 'ListWrapper'

Do you have any idea what is going on? Should I install Tensorflow version 2.1 instead of the version 2.9 that I am using right now?

thanks for the help

Roberto

jsreyl commented 2 years ago

Hi @rmarabini ,

Thanks for reporting this issue. Indeed I have seen that error before, it's caused by differences in the add_loss functions of keras on different tensorflow versions and it requires major rehauling of the code. I've been working on fixing it in the tf29 branch so that all the code can be run in Tensorflow 2.9 but I have not managed to fix it yet. I believe in one to two weeks I will be able to release a TF2.9 compatible version of the code. I'll keep you updated on my progress.

My suggestion is to install Tensorflow 2.1 using pip install -r requirements_tf21.txt to run the training pipeline. Note however previous versions of Tensorflow supported python3.5 to 3.8, so you may have to create an environment with a python version in that range (which is easy to do with conda https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands ).

Let me know if this helps.

Juan

rmarabini commented 2 years ago

Hi @jsreyl,

Thanks for your help ;-). I installed python 3.7, protobuf 3.20 and tensorflow 2.1. This time when I execute " python3 train.py -b temnet" I got a different error. In particular the line

RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

may be interesting. Do I need a particular nvidia driver or cuda version? I attach the output of the command nvidia-smi, so you can see the driver and cuda library that I have.

 cheers

   Roberto

==== TensorFlow Version 2.1.0 Number of devices recognized by Mirror Strategy: 1 Training for RPN: False classes:Dataset: reading data from /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/train classes:Dataset: reading data from /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/val Traceback (most recent call last): File "train.py", line 70, in train_model(args.backbone, args.weights, args.gpu) File "train.py", line 43, in train_model rcnn = RCNN(config, 'train') File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 668, in init self.keras_model = self.build_entire_model(mode) File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 1449, in build_entire_model P2, P3, P4, P5, P6 = self.build_feature_maps(input_image) File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 1234, in build_feature_maps train_bn=self.config.TRAIN_BATCH_NORMALIZATION) File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 1189, in build_backbone x = KL.Conv2D(8, (13, 13), padding="same", name='conv1', use_bias=True)(input_tensor) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 748, in call self._maybe_build(inputs) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2116, in _maybe_build self.build(input_shapes) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/layers/convolutional.py", line 158, in build dtype=self.dtype) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 446, in add_weight caching_device=caching_device) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/training/tracking/base.py", line 744, in _add_variable_with_custom_getter kwargs_for_getter) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 142, in make_variable shape=variable_shape if variable_shape else None) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 258, in call return cls._variable_v1_call(*args, kwargs) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 219, in _variable_v1_call shape=shape) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 197, in previous_getter = lambda kwargs: default_variable_creator(None, *kwargs) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/variable_scope.py", line 2596, in default_variable_creator shape=shape) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/variables.py", line 262, in call return super(VariableMetaclass, cls).call(args, kwargs) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1411, in init distribute_strategy=distribute_strategy) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py", line 1542, in _init_from_args initial_value() if init_from_fn else initial_value, File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer_utils.py", line 122, in init_val = lambda: initializer(shape, dtype=dtype) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/init_ops_v2.py", line 425, in call return self._random_generator.random_uniform(shape, -limit, limit, dtype) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/init_ops_v2.py", line 788, in random_uniform shape=shape, minval=minval, maxval=maxval, dtype=dtype, seed=self.seed) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/ops/random_ops.py", line 265, in random_uniform minval = ops.convert_to_tensor(minval, dtype=dtype, name="min") File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1314, in convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function return constant_op.constant(value, dtype, name=name) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 258, in constant allow_broadcast=True) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 266, in _constant_impl t = convert_to_eager_tensor(value, ctx, dtype) File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 96, in convert_to_eager_tensor return ops.EagerTensor(value, ctx.device_name, dtype) RuntimeError: /job:localhost/replica:0/task:0/device:GPU:0 unknown device.

===== (temnet-env) roberto@clark7:~/scipion3/software/em/TEMNet/scripts/rcnn$ nvidia-smi Fri Jul 8 15:25:51 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:5E:00.0 Off | N/A | | 30% 39C P8 8W / 350W | 4MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:AF:00.0 Off | N/A | | 31% 42C P8 31W / 350W | 116MiB / 24265MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

===

jsreyl commented 2 years ago

Hi @rmarabini ,

Yes, Tensorflow 2.1 requires cuda 10.2 and cuDNN 7.6 to work, there should be no problem with the driver. Let me know if this helps :) .

rmarabini commented 2 years ago

Hi @jsreyl,

I just want to thank you for your help. Unfortunately, after playing a couple of days with virtual environments and different versions I have decided to give up :-(.

Just for the record follows the error message that I get when I tried to train the net


(temnet-env) roberto@clark7:~/scipion3/software/em/TEMNet/scripts/rcnn$ time python3 train.py -b temnet
TensorFlow Version 2.1.0
Number of devices recognized by Mirror Strategy:  2
Training for RPN: False
classes:Dataset: reading data from  /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/train
classes:Dataset: reading data from  /home/roberto/scipion3/software/em/TEMNet/dataset/rcnn_dataset_augmented/val
Traceback (most recent call last):
  File "train.py", line 70, in <module>
    train_model(args.backbone, args.weights, args.gpu)
  File "train.py", line 43, in train_model
    rcnn = RCNN(config, 'train')
  File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 674, in __init__
    self.compile()
  File "/home/roberto/scipion3/software/em/TEMNet/scripts/rcnn/model.py", line 1669, in compile
    self.keras_model.add_loss(loss)
  File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 1081, in add_loss
    self._graph_network_add_loss(symbolic_loss)
  File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 1484, in _graph_network_add_loss
    self._insert_layers(new_layers, new_nodes)
  File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/network.py", line 1439, in _insert_layers
    layer_set = set(self._layers)
  File "/home/roberto/miniconda/envs/temnet-env/lib/python3.7/site-packages/tensorflow_core/python/training/tracking/data_structures.py", line 598, in __hash__
    raise TypeError("unhashable type: 'ListWrapper'")
TypeError: unhashable type: 'ListWrapper' ```
jsreyl commented 2 years ago

Hi @rmarabini , Thanks for reporting the error. That should not be happening in a Tensorflow 2.1 environment. I just updated the code and tested it for training and prediction. It should work now.

We are working on containerizing the application so that the environment setup is not left for the user and the code can be readily run instead of doing all the hassle you've been going through.

Let me know if this helps.