Open CoolCurl opened 1 year ago
I am also interested in this @CoolCurl , did you find a way?
Hi, sorry this isn't an easy thing to address in the current implementation and it is something I will think about adding in the future. This is what I have done in the past. Basically I gather up the indices for the nodes that didn't complete by checking if their output file exists or not. Then I submit them separately as one iteration per node.
import os
def worker_filter(iterable, worker_index, total_workers):
return (p for i,p in enumerate(iterable) if (i-worker_index)%total_workers==0)
def load_df_from_npz(filename):
with np.load(filename, allow_pickle=True) as f:
obj = pd.DataFrame(**f)
return obj
# Identify the indeces for the missing jobs and store them in missing
missing = []
run_params = load_df_from_npz(cnmf_obj.paths['nmf_replicate_parameters'])
for worker_i in range(total_workers):
jobs_for_this_worker = worker_filter(range(len(run_params)), worker_i, total_workers)
for idx in jobs_for_this_worker:
p = run_params.iloc[idx, :]
outfn = cnmf_obj.paths['iter_spectra'] % (p['n_components'], p['iter'])
if not os.path.exists(outfn):
print(worker_i, outfn)
missing.append(worker_i)
# Submit the individual missing jobs to a single node each.
basecmd = "export OMP_NUM_THREADS=6; cnmf factorize --name {name} --output-dir {outdir} --total-workers {tw} --worker-index {i}"
q = 'medium'
for i in missing:
cmd = basecmd.format(name=name, outdir=cnmfdir, i=i, tw=total_workers)
e = os.path.join(cnmfout, '{j}.{i}.err.txt').format(i=i, j=jname)
o = os.path.join(cnmfout, '{j}.{i}.out.txt').format(i=i, j=jname)
bsub_cmd = 'bsub -q {q} -J {j} -o {o} -e {e} "{cmd}"'.format(q=q, j=jname, e=e, o=o, cmd=cmd)
print(bsub_cmd)
# Submit the job in jupyter notebook using !
!{bsub_cmd}
If my process is killed, but it hasn't completed the iteration, part of files have saved, can i write some codes in cnmf.py to continue my work?