Closed Paragjain10 closed 2 years ago
Hello, happy to hear that you are interested in our method :)
I think the output of nvidia-smi changed slightly so our simple way to detect free GPUs broke, I pushed an update. Please let me know if it works now.
Thank you for your response. I made the changes and now I am facing this error .................. tensorflow.python.framework.errors_impl.InvalidArgumentError: Default MaxPoolingOp only supports NHWC on device type CPU [[{{node unet_down_0_to_1/MaxPool}}]] .............
Let me know what should be done.
INFO:main:setting CUDA_VISIBLE_DEVICES to device []
It cannot find a free GPU, that might be the case if you only have one and your desktop environment is running on it.
In principle that is not a problem, you can just set CUDA_VISIBLE_DEVICES before you start the program.
Either via export CUDA_VISIBLE_DEVICES=0
or by prepending CUDA_VISIBLE_DEVICES=0 to your command.
That fixed the previous error.
Now my GPU is running out of memory:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[212,32,48,48] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node gradients/decoder_layer_2_0/Conv2D_grad/Conv2DBackpropInput (defined at /anaconda3/envs/Parag_GreenAI/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Let me know if you need the entire log file! What can be done in this case?
Well, what kind of GPU do you have?
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 207... On | 00000000:01:00.0 Off | N/A | | 0% 41C P8 25W / 215W | 56MiB / 7981MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1001 G /usr/lib/xorg/Xorg 38MiB | | 0 N/A N/A 1058 G /usr/bin/gnome-shell 15MiB | +-----------------------------------------------------------------------------+
hm the memory might be a bit too small. You can try decreasing some parameters (in the config.toml file). I would start with fewer feature maps model.num_fmaps, maybe try 30 autoencoder.num_fmaps, maybe 32, 48, 64 or a smaller patch size, e.g. 25 (patchshape and autoencoder.XX_input_shape) or a smaller train_input_shape
Hello @abred, I have completed my training and my results have been generated. My aim is to recreate the results published in the paper:
According to my understanding, avAP coco is calculated for IoU [0.5:0.95] at a step size of 0.05. I would like to know if my understanding is correct? Also, would like to know how can I recreate the value of average APdsb which is mentioned in the paper (what is the range of thresholds taken into consideration for this along with its step size)
Hi, great to hear that it worked :)
I think you have an older version of the paper, we added some details in the ECCV (and the newer arxiv) version.
AP_coco is precision (TP/(TP+FP) (instances) averaged over [0.5:0.05:0.95].
AP_dsb is TP/(TP+FP+FN) averaged (for wormbodies over the same range)
(I just pushed a small update to https://github.com/Kainmueller-Lab/evaluate-instance-segmentation)
The evaluation code outputs AP_dsb. If you want AP_coco you can either manually average the precision values or change
ap = 1.*(apTP) / max(1, apTP + apFN + apFP)
to ap = 1.*(apTP) / max(1, apTP + apFP)
in evaluate-instance-segmentation/evaluate.py
Thank you for your reply! The results that I achieved have average calculated for AP[0.25:0.05:0.95] and the mean of average of these AP is 0.6567. According to your answer in the previous comment, where AP_dsb is averaged over [0.5:0.05:0.95], I calculated this manually for my results which is 0.5597. This is not the same as the value mentioned in the latest version of the paper.
So, now I tried running the evaluation file with the changes made, but there are a few arguments that need to be passed. During the earlier evaluation, these parameters were given by runn_ppp.py internally. So, I am not very sure about what are the parameters that would go over here if I want to run the evaluation file individually.
Can you tell me what should be given in these arguments?
In general, if you want to repeat a step you can just delete the respective files and execute run_ppp.py
again. For evaluation that is the [val,test]/evaluated
folder.
That dataset was not my focus, but iirc, we did cross validation on it (to be as close as possible to the related work)
I think you have add
[cross_validate]
checkpoints = [550000, 600000, 650000, 700000]
patch_threshold = [0.5, 0.6, 0.7]
fc_threshold = [0.5, 0.6, 0.7]
to the config file of your experiment (maybe change checkpoints depending on how long you have trained) and then execute run_ppp
with --do cross_validate
.
(just pushed a small fix for cross_validate)
@abred I cross-validated to check the results and looks like the average AP_dsb is 0.5607 at checkpoint 400000, which is the maximum iteration for which I have trained the network. Initially, I was getting the best checkpoint as 320000 and now it looks like I am getting better results at 400000. Do you think retraining the network for more iterations would help me get better results (like the ones in the paper)?
hmm 0.56 is a long way from our results. You could try to train a bit longer (700k tended to be a good length usually), but I am not sure it will get 0.1 better, that's a lot. So you are using the setup10 setup? Could you please send me the config file of your experiment?
I am using setup08. This is my config file:
[general]
# error 40
# warning 30
# info 20
# debug 10
logging = 20
debug = false
overwrite = false
[data]
train_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/train'
val_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/test'
test_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/test'
voxel_size = [1, 1]
input_format = 'zarr'
raw_key = "volumes/raw_bf"
gt_key = "volumes/gt_instances"
one_instance_per_channel_gt = "volumes/gt_labels"
num_channels = 1
[model]
train_net_name = 'train_net'
test_net_name = 'test_net'
train_input_shape = [ 256, 256,]
test_input_shape = [ 512, 512,]
patchshape = [ 1, 41, 41,]
patchstride = [ 1, 1, 1,]
num_fmaps = 30
max_num_inst = 2
fmap_inc_factors = [ 2, 2, 2, 2,]
fmap_dec_factors = [ 1, 1, 1, 1,]
downsample_factors = [ [ 2, 2,], [ 2, 2,], [ 2, 2,], [ 2, 2,],]
activation = 'relu'
padding = 'valid'
kernel_size = 3
num_repetitions = 2
# upsampling = 'trans_conv' or 'resize_conv', prefer resize_conv?
upsampling = 'resize_conv'
overlapping_inst = true
code_units = 252
autoencoder_chkpt = "this"
[optimizer]
optimizer = 'Adam'
lr = 0.0001
[preprocessing]
clipmax = 1500
[training]
batch_size = 1
num_gpus = 1
num_workers = 10
cache_size = 40
max_iterations = 500000
checkpoints = 20000
snapshots = 2000
profiling = 500
train_code = true
[training.sampling]
min_masked = 0.001
min_masked_overlap = 0.0001
overlap_min_dist = 0
overlap_max_dist = 15
probability_overlap = 0.5
probability_fg = 0.5
[training.augmentation.elastic]
control_point_spacing = [10, 10]
jitter_sigma = [1, 1]
rotation_min = -45
rotation_max = 45
[training.augmentation.intensity]
scale = [0.9, 1.1]
shift = [-0.1, 0.1]
[training.augmentation.simple]
# mirror = [0, 1, 2]
# tranpose = [0, 1, 2]
[prediction]
output_format = 'zarr'
aff_key = 'volumes/pred_affs'
code_key = 'volumes/pred_code'
fg_key = 'volumes/pred_numinst'
fg_thresh = 0.5
decode_batch_size = 1024
[validation]
[cross_validate]
checkpoints = [320000, 340000, 360000, 380000, 400000, 420000, 440000, 460000, 480000, 500000]
patch_threshold = [0.5, 0.6, 0.7]
fc_threshold = [0.5, 0.6, 0.7]
[testing]
num_workers = 5
[vote_instances]
patch_threshold = 0.9
fc_threshold = 0.5
cuda = true
blockwise = false
num_workers = 8
chunksize = [92, 92, 92]
select_patches_for_sparse_data = true
save_no_intermediates = true
output_format = 'hdf'
parallel = false
includeSinglePatchCCS = true
sample = 1.0
removeIntersection = true
mws = false
isbiHack = false
mask_fg_border = false
graphToInst = false
skipLookup = false
skipConsensus = false
skipRanking = false
skipThinCover = true
affinity_graph_voting = false
affinity_graph_voting_selected = false
termAfterThinCover = false
fg_thresh_vi = 0.1
consensus_interleaved_cnt = false
consensus_norm_prob_product = true
consensus_prob_product = true
consensus_norm_aff = true
vi_bg_use_inv_th = false
vi_bg_use_half_th = false
vi_bg_use_less_than_th = true
rank_norm_patch_score = true
rank_int_counter = false
patch_graph_norm_aff = true
blockwise_old_stitch_fn = false
only_bb = false
# overlap = [ 0, 0, 5,]
flip_cons_arr_axes = false
return_intermediates = false
# aff_graph = "/path/to/file"
# selected_patch_pairs = "/path/to/file"
# crop_z_s = 100
# crop_y_s = 100
# crop_x_s = 100
# crop_z_e = 150
# crop_y_e = 150
# crop_x_e = 150
[evaluation]
num_workers = 1
res_key = 'vote_instances'
metric = 'confusion_matrix.th_0_5.AP'
[postprocessing]
remove_small_comps = 600
[postprocessing.watershed]
output_format = 'hdf'
[visualize]
samples_to_visualize = ['A10', 'C17']
show_patches = true
[autoencoder]
train_net_name = 'train_net'
test_net_name = 'test_net'
train_input_shape = [1, 41, 41]
test_input_shape = [1, 41, 41]
patchshape = [1, 41, 41]
patchstride = [1, 1, 1]
# network_type = 'conv' or 'dense'
network_type = 'conv'
activation = 'relu'
code_activation = 'sigmoid'
# dense
encoder_units = [500, 1000]
decoder_units = [1000, 500]
# conv
num_fmaps = [32, 48, 64]
downsample_factors = [[2, 2], [2, 2], [2, 2]]
upsampling = 'resize_conv'
kernel_size = 3
num_repetitions = 2
padding = 'same'
# if network_type = conv
# code_method = 'global_average_pool' or 'dense' or 'conv'?
code_method = 'conv1x1'
# code_method = 'global_average_pool'
# code_method = 'dense'
code_units = 252
regularizer = 'l2'
regularizer_weight = 1e-4
loss_fn = 'mse'
# upsampling = 'trans_conv' or 'resize_conv', prefer resize_conv?
overlapping_inst = false
ok, setup08 is ppp+dec
.
Two things, for your current setup, please change your config like this (that is, change those values, keep the rest)
[vote_instances]
vi_bg_use_inv_th = true
vi_bg_use_less_than_th = false
fg_thresh_vi = -1.0
skipThinCover = false
mws = true
includeSinglePatchCCS = false
[validation]
params = ["patch_threshold", "fc_threshold"]
[evaluation]
num_workers = 1
res_key = "vote_instances"
metric = "confusion_matrix.avAP"
print_f_factor_perc_gt_0_8 = false
use_linear_sum_assignment = false
foreground_only = false
then delete the val/evaluated
and val/instanced
folders and run cross_validate again (it should recompute the labelling and evaluation)
Additionally you could try training a new experiment where you additionally change these values:
[autoencoder]
overlapping_inst = true
code_method = 'conv1x1_b'
[training.augmentation.elastic]
control_point_spacing = [ 40, 40,]
jitter_sigma = [ 2, 2,]
rotation_min = 0
rotation_max = 90
subsample = 2
[training.sampling]
min_masked = 0.002
min_masked_overlap = 0.002
overlap_min_dist = 0
overlap_max_dist = 15
probability_overlap = 0.5
probability_fg = 0.5
[optimizer]
lr = 5e-5
@abred Out of the two things you suggested, i tried the cross-validate part and the results have been evaluated but the program didn’t get completed with an exit code 0 :
Also, now I have three folders in Val/evaluate
patch_threshold_0_5, patch_threshold_0_6, patch_threshold_0_7 for every checkpoint.
If the code would have exited successfully I would have got a final best checkpoint with its average AP or do I have to compute it manually from these folders?
Having multiple folders is good, that is a sign that it's working. And yes, you would get the best checkpoint with the best parameter set at the end.
Wrt the error, the assertion is some kind of sanity check, could you please print v[0]
and samples
?
I am grateful for your help @abred. The results that I achieved are: Where avAP is 0.695.
Results published in the paper:
I manually formulated a table and calculated the mean of avAP for every checkpoint as seen below. Also, the value avAP(0.695)being displayed is the average for (0_5,0_6,0_7) or one of the best values for 0_5, 0_6, 0_7. I guess the table below will help in understanding my point.
I have trained the network for 400k iterations do you think training for more iterations will help improve results and get them more closely to the results of the paper.
You are very welcome, I am sorry for all the trouble.
Somehow [evaluation]
got lost in my post above (first part of suggestions), I edited it now. We optimized for avAP
where as you are optimizing for AP_0_5 so far, this should fix it.
avAP is the best one (av(eraged) over the IoU thresholds).
Unfortunately you cannot compute it this way. This dataset is a bit messy, which you can guess from the first part (AP_coco) of the table. Everyone has been using different metrics, which makes it really hard to compare it across methods.
And they didn't use a train/val/test split. What we did is: Try to use the same 50:50 train/val split, split val again (2*25) and then cross validate on that to make sure that we don't optimize on the test set (so we take the first 25 as validation and the second 25 as test and then the other way around, which also means that we might select different parameters/checkpoints in the two rounds which you can see in the first line of your output. for the first round it selected checkpoint 340k and 0.7 and for the second one 320k and 0.6)
So, I should update the config file according to the latest changes and cross_validate again?
yes, you just have to modify the values in the evaluation
block in the config like this
[evaluation]
num_workers = 1
res_key = "vote_instances"
metric = "confusion_matrix.avAP"
print_f_factor_perc_gt_0_8 = false
use_linear_sum_assignment = false
foreground_only = false
and run run_ppp again (I think you don't have to delete anything, only the text output will change)
@abred I made the changes and cross_validated again, there are no changes in my results. They are exactly the same as the previous post.
What should be done in this case?
Did the first line in your screenshot change to INFO:__main__:confusion_matrix.avAP
? If not maybe delete the evaluated folder and run it again.
@abred
Deleted the val\evaluated
folder and ran it again yet got the same result;
hmm something seems to be wrong with your config, could you post it again, please? And make sure that you use the path to the same config file in your run_ppp command (and not some older version).
I am using this config file;
[general]
logging = 20 debug = false overwrite = false
[data] train_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/train' val_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/test' test_data = '/home/student2/Desktop/Parag_masterthesis/traintest_data/test' voxel_size = [1, 1] input_format = 'zarr' raw_key = "volumes/raw_bf" gt_key = "volumes/gt_instances" one_instance_per_channel_gt = "volumes/gt_labels" num_channels = 1
[model] train_net_name = 'train_net' test_net_name = 'test_net' train_input_shape = [ 256, 256,] test_input_shape = [ 512, 512,] patchshape = [ 1, 41, 41,] patchstride = [ 1, 1, 1,] num_fmaps = 30 max_num_inst = 2 fmap_inc_factors = [ 2, 2, 2, 2,] fmap_dec_factors = [ 1, 1, 1, 1,] downsample_factors = [ [ 2, 2,], [ 2, 2,], [ 2, 2,], [ 2, 2,],] activation = 'relu' padding = 'valid' kernel_size = 3 num_repetitions = 2
upsampling = 'resize_conv' overlapping_inst = true code_units = 252 autoencoder_chkpt = "this"
[optimizer] optimizer = 'Adam' lr = 0.0001
[preprocessing] clipmax = 1500
[training] batch_size = 1 num_gpus = 1 num_workers = 10 cache_size = 40 max_iterations = 400000 checkpoints = 20000 snapshots = 2000 profiling = 500 train_code = true
[training.sampling] min_masked = 0.001 min_masked_overlap = 0.0001 overlap_min_dist = 0 overlap_max_dist = 15 probability_overlap = 0.5 probability_fg = 0.5
[training.augmentation.elastic] control_point_spacing = [10, 10] jitter_sigma = [1, 1] rotation_min = -45 rotation_max = 45
[training.augmentation.intensity] scale = [0.9, 1.1] shift = [-0.1, 0.1]
[training.augmentation.simple]
[prediction] output_format = 'zarr' aff_key = 'volumes/pred_affs' code_key = 'volumes/pred_code' fg_key = 'volumes/pred_numinst' fg_thresh = 0.5 decode_batch_size = 1024
[validation] params = ["patch_threshold", "fc_threshold"]
[cross_validate] checkpoints = [320000, 340000, 360000, 380000, 400000] patch_threshold = [0.5, 0.6, 0.7] fc_threshold = [0.5, 0.6, 0.7]
[testing] num_workers = 5
[vote_instances] patch_threshold = 0.9 fc_threshold = 0.5 cuda = true blockwise = false num_workers = 8 chunksize = [92, 92, 92] select_patches_for_sparse_data = true save_no_intermediates = true output_format = 'hdf' parallel = false includeSinglePatchCCS = false sample = 1.0 removeIntersection = true mws = true isbiHack = false mask_fg_border = false graphToInst = false skipLookup = false skipConsensus = false skipRanking = false skipThinCover = false affinity_graph_voting = false affinity_graph_voting_selected = false termAfterThinCover = false fg_thresh_vi = -0.1 consensus_interleaved_cnt = false consensus_norm_prob_product = true consensus_prob_product = true consensus_norm_aff = true vi_bg_use_inv_th = false vi_bg_use_half_th = true vi_bg_use_less_than_th = false rank_norm_patch_score = true rank_int_counter = false patch_graph_norm_aff = true blockwise_old_stitch_fn = false only_bb = false
flip_cons_arr_axes = false return_intermediates = false
[evaluation] num_workers = 1 res_key = 'vote_instances' metric = 'confusion_matrix.th_0_5.AP' print_f_factor_perc_gt_0_8 = false use_linear_sum_assignment = false foreground_only = false
[postprocessing] remove_small_comps = 600
[postprocessing.watershed] output_format = 'hdf'
[visualize] samples_to_visualize = ['A10', 'C17'] show_patches = true
[autoencoder] train_net_name = 'train_net' test_net_name = 'test_net' train_input_shape = [1, 41, 41] test_input_shape = [1, 41, 41] patchshape = [1, 41, 41] patchstride = [1, 1, 1]
network_type = 'conv' activation = 'relu' code_activation = 'sigmoid'
encoder_units = [500, 1000] decoder_units = [1000, 500]
num_fmaps = [32, 48, 64] downsample_factors = [[2, 2], [2, 2], [2, 2]] upsampling = 'resize_conv' kernel_size = 3 num_repetitions = 2 padding = 'same'
code_method = 'conv1x1'
code_units = 252 regularizer = 'l2' regularizer_weight = 1e-4 loss_fn = 'mse'
overlapping_inst = false
yes please change [evaluation].metric
to metric = "confusion_matrix.avAP"
as above
I am so sorry for the mistake. The results are now changed to:
The new result is better but yet it is not as good as the one in the paper. The avAP here is 0.6989 while in the paper it is 0.727. Do you think this is good enough or better results can be achieved? @abred
I mean, we got 0.727 so it is possible ;) Your results are close enough to assume that everything is working as expected now, that's good. I would say there are two options now, train a bit longer (maybe 700k) or use the parameters I posted above to train a completely new setup. Both might lead to a bit better results. However, due to your GPU memory constraints your networks are also slightly smaller, this is probably also one reason that the results are a bit worse.
Okay, thank you so much for your constant help. I am going to train this network with my dataset as well. So, I guess training the network more longer (700k) and using those parameters could be done during that training.
No problem. What kind of data do you have, if I may ask? Might also be the case that for different data slightly different parameters are optimal.
cool, sounds interesting :) Then I wouldn't spend too much time trying to optimize it for the worm data, your data looks quite different. But maybe still try it with the updated parameters I have posted.
cool, sounds interesting :) Then I wouldn't spend too much time trying to optimize it for the worm data, your data looks quite different. But maybe still try it with the updated parameters I have posted.
Yes, I agree. Should I train it for 700k iterations with the updated parameters at first itself? Or first, try to train it for 400k-500k iterations?
How long does the training take? If it is quick it is probably easier to just train it and then do cross_validate over a larger range of checkpoints (maybe with larger steps at first) (Btw, in case it is not clear, to continue training you just have to change the max_iterations value in the config and then execute run_ppp
with -d train
again)
Yes, okay this sounds perfect.
Now I am working with my dataset. Along with getting it in the right format for the network what are the other things that need to be changed in the code My dataset has different classes. Description of classes are as below:
class_ids = {1: 'Vaccinium_Singularized', 2: 'Vaccinium_Target object', 3: 'Vaccinium_Occluded', 4: 'Vaccinium_2nd_Step_Occluded', 5: 'Waste', } by dividing the intensity of images mask by 4 then for each pixel that is not zero the left digit is class id and the right digit is instance id. @abred
Hi,
in the current form PatchPerPix does not do semantic segmentation, but instance segmentation (separates instances but does not assign class identities to pixels/instances). One relatively straightforward way to get semantic labels is to add another channel to the network (or train a small separate one) that predicts the semantic label per pixel and then do majority voting over all pixels in an instance at the end of postprocessing.
Hello, I am a student performing my thesis at the University of Bremen. I am using your method for my thesis. I tried following all these steps mentioned in the repository, but when I run the code this error pops up:
Traceback (most recent call last): File "/home/student2/Desktop/Parag_masterthesis/PatchPerPix/PatchPerPix_experiments/run_ppp.py", line 1588, in
main()
File "/home/student2/Desktop/Parag_masterthesis/PatchPerPix/PatchPerPix_experiments/run_ppp.py", line 1377, in main
quantity=config['training']['num_gpus'])
File "/home/student2/Desktop/Parag_masterthesis/PatchPerPix/PatchPerPix/util/selectGPU.py", line 27, in selectGPU
pid = ln.split()[1]
IndexError: list index out of range
Has anybody faced the same issue? Can you help me with this? Thank you!