Retina-generate_kmeans - Githubissues

ChenLiu-1996 / CUTS

[MICCAI 2024] CUTS: A Deep Learning and Topological Framework for Multigranular Unsupervised Medical Image Segmentation

https://arxiv.org/abs/2209.11359

Other

33 stars 2 forks source link

Retina-generate_kmeans #3

Closed 996237275 closed 4 months ago

996237275 commented 1 year ago

Hello! I finished training the encoder on retina dataset, however, there went some problem when I wanted to use generate_kmeans and generate_diffusion. I tried the tips that you mentioned about the 'deadlock', but it still cannot work. The only script can work is generate_baseline.py.

ChenLiu-1996 commented 1 year ago

Hi.

By far, the known problem is that kmeans and diffusion requires some decent RAM (e.g., on slurm, I would need to set --mem=20G for it to successfully run). If you are using a service like slurm, you may need to request more memory.

If that is not the issue,

Did you try using the latest code?
Can you give more details (perhaps screenshots) of the error?

996237275 commented 1 year ago

I use a single A100 for experiment. Maybe it will not be the problem? I donot find --mem=20G in generate_xx,py 1.yes, I updated code already. 2.

ChenLiu-1996 commented 1 year ago

Oh I might have said something confusing. --mem=20G is the setting for running a slurm job on a server. It means we need to request 20GB of RAM for that job in order to run the script successfully. If you don't have enough RAM it may be a problem. But in most cases if you are running on a server that does not have job allocation, you shall have more than enough RAM

A single A100 shall be more than enough.

Regarding your screenshot: What if you do not use the --rerun argument? I recent found that to be more helpful.

996237275 commented 1 year ago

Acctually, I tried --rerun already. However it cannot work. 27ec6cce36aa8797089c6944169347d

996237275 commented 1 year ago

When using generate_diffusion.py, it looks like get into deadlock also? 23020f61201557b8c244d7d68acd984

ChenLiu-1996 commented 1 year ago

Unfortunately I don't really understand the root cause of the problem.

So far the setting that works on my end is:

Do NOT use the --rerun flag.
Make sure you have around 20GB of RAM. (I have not tested to see the limits, but what I can say is that <=10GB of RAM will not work on my server).

ChenLiu-1996 commented 1 year ago

If this still does not work, the following resources might be helpful.

You may try running export MKL_THREADING_LAYER=GNU before running the generate_diffusion.py or generate_kmeans.py.
Additional resources at https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

996237275 commented 1 year ago

I set the export MKL_THREADING_LAYER=GNU on Linux, but it still remain 'deadlock 051bc94f2ff592076a32c48ce6648a0

ChenLiu-1996 commented 1 year ago

Thanks for trying these out. At this moment I am basically clueless. Sorry for not being able to be more helpful.

I am still suspecting it's a RAM problem but I don't have a solid proof.

ChenLiu-1996 commented 8 months ago

I have updated the code for generating kmeans and diffusion condensation. I believe it may be good now?