Closed 996237275 closed 4 months ago
Hi.
By far, the known problem is that kmeans and diffusion requires some decent RAM (e.g., on slurm, I would need to set --mem=20G
for it to successfully run). If you are using a service like slurm, you may need to request more memory.
If that is not the issue,
I use a single A100 for experiment. Maybe it will not be the problem? I donot find --mem=20G in generate_xx,py 1.yes, I updated code already. 2.
Oh I might have said something confusing. --mem=20G
is the setting for running a slurm job on a server. It means we need to request 20GB of RAM for that job in order to run the script successfully. If you don't have enough RAM it may be a problem. But in most cases if you are running on a server that does not have job allocation, you shall have more than enough RAM
A single A100 shall be more than enough.
Regarding your screenshot:
What if you do not use the --rerun
argument? I recent found that to be more helpful.
Acctually, I tried --rerun
already. However it cannot work.
When using generate_diffusion.py
, it looks like get into deadlock also?
Unfortunately I don't really understand the root cause of the problem.
So far the setting that works on my end is:
--rerun
flag.If this still does not work, the following resources might be helpful.
export MKL_THREADING_LAYER=GNU
before running the generate_diffusion.py
or generate_kmeans.py
.I set the export MKL_THREADING_LAYER=GNU
on Linux, but it still remain 'deadlock
Thanks for trying these out. At this moment I am basically clueless. Sorry for not being able to be more helpful.
I am still suspecting it's a RAM problem but I don't have a solid proof.
I have updated the code for generating kmeans and diffusion condensation. I believe it may be good now?
Hello! I finished training the encoder on retina dataset, however, there went some problem when I wanted to use generate_kmeans and generate_diffusion. I tried the tips that you mentioned about the 'deadlock', but it still cannot work. The only script can work is generate_baseline.py.