DeepTrackAI / DeepTrack2

DeepTrack2
MIT License
159 stars 47 forks source link

cuDNN Error in MAGIK Colab #132

Closed XinyueZhang831 closed 2 years ago

XinyueZhang831 commented 2 years ago

Hi @JesusPinedaC ,

Thank you for solving my last question!

I am currently struggled by another issue in the MAGIK colab notebook, I am not sure if this is because the GPU is not enough to process the training or there are some packages are in the wrong version.

The colab works fine if I don't use GPU, but it can only predict only few frames (I tried 100 frames, the normal ram crushed). So I change to GPU, I tried it with 10 frames, but it did work with the following return message.

It would be great if you have any suggestion!

Creating graph edges... 100%|██████████| 1/1 [00:11<00:00, 11.90s/it]

InternalError Traceback (most recent call last) in 1 pred, gt, scores, graph = dt.models.gnns.get_predictions( ----> 2 test_nodesdf, ["centroid"], model, variables 3 )

3 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name) 7162 def raise_from_not_ok_status(e, name): 7163 e.message += (" name: " + name if name is not None else "") -> 7164 raise core._status_to_exception(e) from None # pylint: disable=protected-access 7165 7166

InternalError: Exception encountered when calling layer "layer_normalization" (type LayerNormalization).

cuDNN launch failure : input shape ([1,4045,64,1]) [Op:FusedBatchNormV3]

Call arguments received by layer "layer_normalization" (type LayerNormalization): • inputs=tf.Tensor(shape=(1, 4045, 64), dtype=float32)

XinyueZhang831 commented 2 years ago

Also, I have a question about the algorithm behind the MAGIK GNN model. Cause I am dealing with particles which disappear and reappear again, will MAGIK learns this and links the particle correctly? Should I generate some data to train it instead of use the pretrained model?

Thank you!

XinyueZhang831 commented 2 years ago

I just run the code again, it seems Colab GPU is not sufficient and cannot afford even 40 frames.

JesusPinedaC commented 2 years ago

Hello!

Thank you for your message!

I cannot reproduce in Colab or locally the problem you describe. The notebook runs normally and the colab GPU is able to process all the video without running into memory issues.

Two questions:

  1. Have you tried to terminate and restore the Colab notebook? Try running the example again on this notebook.

  2. Are you running the notebook on the same test data as the original example?

XinyueZhang831 commented 2 years ago

Hi @JesusPinedaC !

I just ran the code on your notebook, and the example works well. But when I use my own data, it shows this. I think this is because the GPU is not sufficient..?

Creating graph edges... 100%|██████████| 1/1 [07:30<00:00, 450.06s/it]

ResourceExhaustedError Traceback (most recent call last) in 1 pred, gt, scores, graph = dt.models.gnns.get_predictions( ----> 2 test_nodesdf, ["centroid"], model, variables 3 )

3 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name) 7184 def raise_from_not_ok_status(e, name): 7185 e.message += (" name: " + name if name is not None else "") -> 7186 raise core._status_to_exception(e) from None # pylint: disable=protected-access 7187 7188

ResourceExhaustedError: Exception encountered when calling layer "edge_ide2" (type Dense).

OOM when allocating tensor with shape[1,8401639,96] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

Call arguments received: • inputs=tf.Tensor(shape=(1, 8401639, 64), dtype=float32)

JesusPinedaC commented 2 years ago

Also, I have a question about the algorithm behind the MAGIK GNN model. Cause I am dealing with particles which disappear and reappear again, will MAGIK learns this and links the particle correctly? Should I generate some data to train it instead of use the pretrained model?

Thank you!

The pre-trained model must exhibit gap-closing capabilities within a 3-frame time window.

This model was trained on cell data using only x/y centroids (to favor generalizability). Here, MAGIK can rely on the directionality of the cell movement to account for detection blinking.

However, since every dataset is different, it is best to train with custom data if possible. Mainly if 1) you aim to reconnect detections that blink for longer than three frames in a row, and 2) if there is additional information to the x/y centroids that MAGIK can use as node features to boost its gap-closing capabilities (for example, particle intensity, morphological features).

JesusPinedaC commented 2 years ago

Hi @JesusPinedaC !

I just ran the code on your notebook, and the example works well. But when I use my own data, it shows this. I think this is because the GPU is not sufficient..?

Creating graph edges...

100%|██████████| 1/1 [07:30<00:00, 450.06s/it] ResourceExhaustedError Traceback (most recent call last) in 1 pred, gt, scores, graph = dt.models.gnns.get_predictions( ----> 2 test_nodesdf, ["centroid"], model, variables 3 )

3 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name) 7184 def raise_from_not_ok_status(e, name): 7185 e.message += (" name: " + name if name is not None else "") -> 7186 raise core._status_to_exception(e) from None # pylint: disable=protected-access 7187 7188

ResourceExhaustedError: Exception encountered when calling layer "edge_ide2" (type Dense).

OOM when allocating tensor with shape[1,8401639,96] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd]

Call arguments received: • inputs=tf.Tensor(shape=(1, 8401639, 64), dtype=float32)

Yes, it is definitely a memory problem.

You are trying to process 8401639 detections.

Do these detections correspond to a single video or are they multiple videos stacked in the dataframe?

XinyueZhang831 commented 2 years ago

Also, I have a question about the algorithm behind the MAGIK GNN model. Cause I am dealing with particles which disappear and reappear again, will MAGIK learns this and links the particle correctly? Should I generate some data to train it instead of use the pretrained model? Thank you!

The pre-trained model must exhibit gap-closing capabilities within a 3-frame time window.

This model was trained on cell data using only x/y centroids (to favor generalizability). Here, MAGIK can rely on the directionality of the cell movement to account for detection blinking.

However, since every dataset is different, it is best to train with custom data if possible. Mainly if 1) you aim to reconnect detections that blink for longer than three frames in a row, and 2) if there is additional information to the x/y centroids that MAGIK can use as node features to boost its gap-closing capabilities (for example, particle intensity, morphological features).

Thank you! These are very useful information!

XinyueZhang831 commented 2 years ago

Hi @JesusPinedaC ! I just ran the code on your notebook, and the example works well. But when I use my own data, it shows this. I think this is because the GPU is not sufficient..?

Creating graph edges...

100%|██████████| 1/1 [07:30<00:00, 450.06s/it] ResourceExhaustedError Traceback (most recent call last) in 1 pred, gt, scores, graph = dt.models.gnns.get_predictions( ----> 2 test_nodesdf, ["centroid"], model, variables 3 ) 3 frames /usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name) 7184 def raise_from_not_ok_status(e, name): 7185 e.message += (" name: " + name if name is not None else "") -> 7186 raise core._status_to_exception(e) from None # pylint: disable=protected-access 7187 7188 ResourceExhaustedError: Exception encountered when calling layer "edge_ide2" (type Dense). OOM when allocating tensor with shape[1,8401639,96] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:BiasAdd] Call arguments received: • inputs=tf.Tensor(shape=(1, 8401639, 64), dtype=float32)

Yes, it is definitely a memory problem.

You are trying to process 8401639 detections.

Do these detections correspond to a single video or are they multiple videos stacked in the dataframe?

Yes, I am working on one video now, which has 984 frames.

JesusPinedaC commented 2 years ago

The limitation here is purely computational. Colab's GPU cannot handle this number of detections.

Fortunately, there is a way around the problem.

  1. One option is to analyze the video in temporary windows (e.g., 5 and 10 frames). It is preferable if the intervals overlap. Therefore, the prediction of each edge within the overlapping region will be the product of the predictions (1s, "linked" or 0s, "unlinked") obtained in each time window. For a true connection, both windows must classify the edge as "linked."

This strategy may still cause memory problems due to the density of your videos. In this case, I suggest the following:

  1. Combine spatial and temporal windows. Divide each frame into smaller subregions with fewer detections. Accordingly, the criteria to classify an edge as a "connection" will depend on each time window and region's predictions.

This would be a nice feature for DeepTrack/MAGIK. Please feel free to push your solution; we will gladly review it!

I will work on a solution in the next few weeks. As soon as it's incorporated, I'll let you know.

XinyueZhang831 commented 2 years ago

The limitation here is purely computational. Colab's GPU cannot handle this number of detections.

Fortunately, there is a way around the problem.

  1. One option is to analyze the video in temporary windows (e.g., 5 and 10 frames). It is preferable if the intervals overlap. Therefore, the prediction of each edge within the overlapping region will be the product of the predictions (1s, "linked" or 0s, "unlinked") obtained in each time window. For a true connection, both windows must classify the edge as "linked."

This strategy may still cause memory problems due to the density of your videos. In this case, I suggest the following:

  1. Combine spatial and temporal windows. Divide each frame into smaller subregions with fewer detections. Accordingly, the criteria to classify an edge as a "connection" will depend on each time window and region's predictions.

This would be a nice feature for DeepTrack/MAGIK. Please feel free to push your solution; we will gladly review it!

I will work on a solution in the next few weeks. As soon as it's incorporated, I'll let you know.

Hi,

Thank you for the information, it is very helpful to know!

Thank you again!

JesusPinedaC commented 2 years ago

Do not hesitate to contact us again if you need help with any stage of the implementation!