PeterJackNaylor / DRFNS

This repository contains the code necessary in order to reproduce the work contained in the submitted paper: "Segmentation of Nuclei in Histopathology Images by deep regression of the distance map".
MIT License
47 stars 13 forks source link

About The Memory Usage #4

Closed 6zhc closed 4 years ago

6zhc commented 5 years ago

I clone your code and alreay build the environment for it. After I run the commond nextflow run realdataset.nf --epoch 80 -c nextflow.config -resume , It suck in the step of createTFrecord and the mem usage increase fast. After almost 4 hours, it still not get to the next step and memory usage come to more than 100G. I wonder whether it go wrong? I use the dataset download by download_data.sh

Here is the output in the .nextflow.log

Jun-08 17:56:28.512 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 8; name: Mean (2); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/7d/effe8d8bb6ae2ca9759f15bdf5fc37]
Jun-08 17:57:51.916 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 3; name: BinToDistance; status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/2d/7e883ab52d392bc43a2d66e7ace529]
Jun-08 17:57:51.925 [Actor Thread 9] DEBUG nextflow.Session - <<< barrier arrive (process: BinToDistance)
Jun-08 17:57:51.935 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Jun-08 17:57:51.936 [Task submitter] INFO  nextflow.Session - [7b/cd94e7] Submitted process > Mean (3)
Jun-08 17:57:51.942 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Jun-08 17:57:51.942 [Task submitter] INFO  nextflow.Session - [92/a582bb] Submitted process > CreateRecords (7)
Jun-08 17:57:51.989 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Jun-08 17:57:51.996 [Task submitter] INFO  nextflow.Session - [37/377809] Submitted process > CreateRecords (8)
Jun-08 17:57:52.001 [Task submitter] DEBUG nextflow.executor.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
Jun-08 17:57:52.001 [Task submitter] INFO  nextflow.Session - [92/fb4f38] Submitted process > CreateRecords (9)
Jun-08 17:57:58.117 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 12; name: Mean (3); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/7b/cd94e76e959e18810079869a009ab5]
Jun-08 17:57:58.120 [Actor Thread 12] DEBUG nextflow.Session - <<< barrier arrive (process: Mean)
Jun-08 17:58:27.133 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 9; name: CreateRecords (4); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/a8/d819b03f23e5e5a14159f2d75b451b]
Jun-08 17:58:45.317 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 10; name: CreateRecords (5); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/13/6919013f9786a3028655f001b32011]
Jun-08 17:59:14.192 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 7; name: CreateRecords (3); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/06/fc7304daaed31a92fe99d666f2ac4b]
Jun-08 17:59:58.729 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 6; name: CreateRecords (2); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/3d/f822a1403d234f74db5ca522df647c]
Jun-08 18:01:07.643 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 15; name: CreateRecords (9); status: COMPLETED; exit: 0; error: -; workDir: /home2/zhc/DRFNS/work/92/fb4f389fe4812545a69e9a76e65636]
Jun-08 18:01:20.865 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 4 -- pending tasks are shown below
~> TaskHandler[id: 5; name: CreateRecords (1); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22]
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
~> TaskHandler[id: 14; name: CreateRecords (8); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/37/3778092909ab3a281847576c508b71]
....(something almost same I didn't cope it.)
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
Jun-08 21:56:23.329 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 3 -- pending tasks are shown below
~> TaskHandler[id: 5; name: CreateRecords (1); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22]
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
Jun-08 22:01:23.370 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 3 -- pending tasks are shown below
~> TaskHandler[id: 5; name: CreateRecords (1); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22]
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
Jun-08 22:06:23.403 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 3 -- pending tasks are shown below
~> TaskHandler[id: 5; name: CreateRecords (1); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22]
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
Jun-08 22:11:23.440 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 3 -- pending tasks are shown below
~> TaskHandler[id: 5; name: CreateRecords (1); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22]
~> TaskHandler[id: 11; name: CreateRecords (6); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/5b/17c8268c4a072cde81e020b34f8a8a]
~> TaskHandler[id: 13; name: CreateRecords (7); status: RUNNING; exit: -; error: -; workDir: /home2/zhc/DRFNS/work/92/a582bbfc6e9fe260697f0fd54ad95b]
PeterJackNaylor commented 5 years ago

Hey, What hardware are you running this on? I am maybe wrong, but nextflow is maybe launching processes on your computer? The following situation doesn't seem to conflict with what you are saying: If nextflow is running on your local computer which has, (lets say 8 cpus), he will be able to submit 8 processes on your local machine, and maybe this is too much for the computer which explode your RAM and then start swapping.... If this is the situation, you could try and limit the number of submitted processes to 1.

6zhc commented 5 years ago

I don't think so. The log shows there is just three process working and having worked for alongtime. Createrecord should be 9 process, and other six has been finished.

PeterJackNaylor commented 5 years ago

Yes, but only the creation of the training record is long and memory intensive (the others are for validation and test and are relatively small) Is there any swap going on or what hardware are you running on? I believe your issue is only a nextflow/hardware configuration

6zhc commented 5 years ago

I run it on the sever with 4CPU each has 12 processors, 4 GTX1080 GPU and 256G memory. I think hardware is good enough. I have no idea how to change the config, for nextflow is totally new for me. I just use your configure. it might cause by it. Could you tell me how to set it?

PeterJackNaylor commented 5 years ago

ok, indeed you hardware is good enough. This is one local machine and with such specs it shouldn't pause. Could you add to your nextflow.config the following lines: profiles { local { process.executor = 'local' executor.queueSize = 1 } } This should force nextflow to only submit one process at a time. Hopefully this should work. If you were running on a cluster (like a SGE cluster, or PBS, or slurm)
You can set process.executor = 'sge' to the right scheduler.

If this fails, I would connect to my server, change directory to one of the hanging processes (like /home2/zhc/DRFNS/work/4a/265bb911fd52b33cb3f49a2f6e5b22 for instance), and run the command: bash .command.sh This should run, show you python errors and other specific errors in code.