No space left on device error during inference

zchiang commented 4 years ago

Hi,

I've been running AtacWorks on my university's cluster, where not every job gets the same GPU. Occasionally, I've been running into the following error during the inference step

`Inference ##########---------- [200/412] Inference ##########---------- [210/412] Process Process-2: multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar return list(itertools.starmap(args[0], args[1])) File "/n/holylfs/LABS/buenrostro_lab/Users/zchiang/projects/AtacWorks/main.py", line 127, in save_to_bedgraph df_to_bedGraph(batch_bg, outfile) File "/n/holylfs/LABS/buenrostro_lab/Users/zchiang/projects/AtacWorks/claragenomics/io/bedgraphio.py", line 88, in df_to_bedGraph df.to_csv(outfile, sep='\t', header=False, index=False) File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/site-packages/pandas/core/generic.py", line 3229, in to_csv formatter.save() File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 202, in save self._save() File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 324, in _save self._save_chunk(start_i, end_i) File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 356, in _save_chunk libwriters.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) File "pandas/_libs/writers.pyx", line 69, in pandas._libs.writers.write_csv_rows OSError: [Errno 28] No space left on device """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/n/holylfs/LABS/buenrostro_lab/Users/zchiang/projects/AtacWorks/main.py", line 220, in writer pool.starmap(save_to_bedgraph, map_args) File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/pool.py", line 296, in starmap return self._map_async(func, iterable, starmapstar, chunksize).get() File "/n/home01/zchiang/.conda/envs/AtacWorks/lib/python3.6/multiprocessing/pool.py", line 670, in get raise self._value OSError: [Errno 28] No space left on device Inference ###########--------- [220/412] Inference ###########--------- [230/412] `

Two lines of questioning: 1) Is this an issue with the GPU or CPU? And is there a parameter I can adjust to avoid this? 2) Would it be possible to stop the inference from running when it encounters an error like this?

Please let me know if you need any other info.

ntadimeti commented 4 years ago

@zchiang I would need a little bit more info about the GPUs that are assigned to this job when this error happens. Would you be able to share the logs ?

And we should be able to kill inference if such errors pop-up. I'll look into it.

zchiang commented 4 years ago

Here's an example of one of the nvidia-smi outputs before anything starts running.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   41C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In practice, the cluster has a mix of Tesla K20m, K40m, K80, and V100s -- and I can't control which job gets which GPU. I wrote a super simple script to detect which GPU each job gets and adjusts the inference bs parameter between 128 (for the Ks) and 256 (for the V100). The inference command I've been running is

(python $root_dir/main.py --infer \
        --infer_files $out_dir/$prefix/$prefix.h5 \
        --intervals_file $intervals \
        --sizes_file $ref_dir/mm10.chrom.sizes \
        --weights_path $saved_model_dir/$model \
        --out_home $out_dir/$prefix \
        --result_fname 1M_pretrained \
        --model resnet \
        --nblocks 5 \
        --nfilt 15 \
        --width 51 \
        --dil 8 \
        --nblocks_cla 2 \
        --nfilt_cla 15 \
        --width_cla 51 \
        --dil_cla 8 \
        --task both \
        --num_workers 4 \
        --gen_bigwig \
        --pad 5000 \
        --bs $bs)>&2

zchiang commented 4 years ago

Oh, should add that each individual job on the cluster also gets 4 CPUs and 24 GB of memory

ntadimeti commented 4 years ago

@zchiang It looks like a CPU problem, since it's crashing at df_to_csv step which happens on CPU. Could you tell me whether this problem is intermittent or completely reproducible ? To me it doesn't seem to do much with GPUs on the cluster at all, it looks like your disk is running out of space. Can you verify if that's the case ?

ntadimeti commented 4 years ago

Hi @zchiang just checking in to see if you are still facing this problem. Please let me know if it's resolved, I'd like to close it. If you're still facing it, I'd be happy to look into it with logs and repro datasets

ntadimeti commented 4 years ago

Closing due to no activity. Can re-open in future if necessary.

NVIDIA-Genomics-Research / AtacWorks

No space left on device error during inference #133