Angryrou / UDAO2022

MIT License
0 stars 0 forks source link

for 1st release #17

Open Angryrou opened 1 year ago

Angryrou commented 1 year ago

Additional TODOs @QiFAN-lix

Angryrou commented 1 year ago

@QiFAN-lix please update your progress on the todo items above. Also please let me know when you can finish coding so that I can start working on it.

QiFAN-lix commented 1 year ago

Hi @Angryrou, I will finish coding by this weekend, and I will let you know then. P.S. I will add the tag after I commit the code for the code review.

QiFAN-lix commented 1 year ago

Hi @Angryrou , Sorry to reply late as I am feeling highly uncomfortable this weekend with muscle pain and fever. Currently I added examples for HCF and GPR (I have already tested on ercilla), which I think is enough for you to do a code review. Please feel free to let me know if you have any questions, and I will update NN later.

QiFAN-lix commented 1 year ago

Hi @Angryrou , I updated the example with NN model.

In the NN model, I provided two options for users to define their functions for objectives and constraints. One is NN, and the other is HCF. But I failed to train NN by using the training data in the example whose loss keeps the same. The aim to train this NN model is to make sure all the 3 models solve the same optimization problem, where GPR does the same.

As the model should be provided by the users, in the example, currently I made it use HCF by default. It would be great if you can fix the NN model quickly. Otherwise, we can use other simple NN models in the example.

Thanks for your effort and please feel free to let me know if you have any comments.

Angryrou commented 1 year ago

Ok, I will look into it. Can you tag an ICDE-paper-v0.1 for the reproducing results? I will be coding tomorrow @QiFAN-lix

Angryrou commented 1 year ago

I have tagged

Angryrou commented 1 year ago

I am updating the main issues that I cannot have a quick fix from the current version (still updating)

QiFAN-lix commented 1 year ago

Hi @Angryrou , thanks for your effort and update! I will work on them after receiving all your updates.

Angryrou commented 1 year ago

Our terms for the variables are inconsistent. We use knob, variable, and configuration for the same or different things. Since our work is originally for knob tuning in Spark, let us only use knob and configuration (conf for short) only as in our ICDE paper. In the most case, I will also use knobs to represent a conf.

@QiFAN-lix please do not use the term variable for knob to avoid inconsistency.

QiFAN-lix commented 1 year ago

Hi @Angryrou , to clarify, could you please tell me what are the items I need to modify and when I'd better do them? I know that one important issue is refactoring the MOGD, which may affect designs in other parts. Besides that, what else independent items I am able to do in parallel for now without messing up the code if git commits?

My aim is that we'd better finish the code review as soon as possible while under the design we both agree. If we cannot finish by this weekend, we can try to finish next week. To me, next week is an approachable deadline.

Please feel free to let me know if you have any comments.

Angryrou commented 1 year ago

I will let you know when I sort it out.

Angryrou commented 1 year ago

a potential error to fix when running

image
Angryrou commented 1 year ago

Hi Qi, I tested the functionality for the examples. Mostly look good to me. You just need to fix the assertion error I mentioned above.

Please also finish the README by yourself so that Arnab and I can further read, check (and modify if needed). Do not forget to provide the PO figures as mentioned in

QiFAN-lix commented 1 year ago

Hi @Angryrou , did you check whether the results are from PF-AP or PF-AS? BTW, did you modify any code after review? Currently, I am not clear on what you will update for the review.

Angryrou commented 1 year ago

I followed the commands you gave above, which led to the warning message. The configuration file should be at

You can check my two commits at the git history. I only changed some format thing and reorginze the file structure.

Angryrou commented 1 year ago

@sinharnab please give some inputs from your code review.

QiFAN-lix commented 1 year ago

Hi @Angryrou , please change the pf_option in the .json file as pf-ap. In the example, by default, it asserts the PF-AP results.

Angryrou commented 1 year ago

I observed the same errors after changing to

(udao2) [chenghao@ercilla UDAO2022]$ python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/2d/pf_mogd.json
/home/chenghao/miniconda3/envs/udao2/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance.
grad.sizes() = [16, 2], strides() = [2, 1]
param.sizes() = [16, 2], strides() = [1, 16] (Triggered internally at  ../torch/csrc/autograd/functions/accumulate_grad.h:193.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Pareto solutions of wl_None:
[[  0.      50.    ]
 [ 19.748   30.537 ]
 [ 23.0612  26.6653]
 [ 24.7432  26.5858]
 [ 24.9808  26.4452]
 [ 26.6404  25.5601]
 [ 37.4084  20.4521]
 [136.       4.    ]]
Variables of wl_None:
[[5.   3.  ]
 [2.79 2.77]
 [2.67 2.42]
 [2.57 2.47]
 [2.56 2.46]
 [2.49 2.4 ]
 [2.11 2.  ]
 [0.   0.  ]]
Time cost of wl_None:
0.4915339946746826
/home/chenghao/UDAO2022/examples/optimization/heuristic_closed_form/pf.py:55: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  assert (po_vars == np.array(
Traceback (most recent call last):
  File "/home/chenghao/UDAO2022/examples/optimization/heuristic_closed_form/pf.py", line 55, in <module>
    assert (po_vars == np.array(
AttributeError: 'bool' object has no attribute 'all'
Angryrou commented 1 year ago

@QiFAN-lix

Could you please show us the correct message (in a screenshot) running on your own place?

Could you please modify the .json files to align with the examples that can pass the assertion?

QiFAN-lix commented 1 year ago

Hi @Angryrou , I git clone the latest version, and testing ercilla with the following results after using 'pf-ap' in the json file, which works fine.

image
Angryrou commented 1 year ago

Can you attach results with the commands? @QiFAN-lix

Angryrou commented 1 year ago

Could you please modify the .json files to align with the examples that can pass the assertion?

@QiFAN-lix would you please fix this first so that other users (like Arnab) can test it directly?

@sinharnab Would you please test over the example commands and let us know how everything works on your side?

QiFAN-lix commented 1 year ago

HCF:

image image

NN:

image
QiFAN-lix commented 1 year ago

@Angryrou @sinharnab please git pull the latest code.

Angryrou commented 1 year ago

works like a charm.

Could you please add a parameter named verbose to control whether to print the intermediate results, and set verbose=False by default? Thanks.

QiFAN-lix commented 1 year ago

Hi @Angryrou , did you use the same conda env as UDAO2022?

If you are available now, it would be quicker to discuss over zoom.

sinharnab commented 1 year ago

Can the package platypus-opt==1.0.4 be listed in the requirements.txt?

sinharnab commented 1 year ago

@Angryrou @sinharnab please git pull the latest code.

I did and running HCF (UDAO2022) MBPro-Arnab:UDAO2022 arnab$ /opt/anaconda3/envs/UDAO2022/bin/python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/2d/pf_mogd.json

is giving me the following below:

`(UDAO2022) MBPro-Arnab:UDAO2022 arnab$ /opt/anaconda3/envs/UDAO2022/bin/python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/2d/pf_mogd.json /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 2], strides() = [2, 1] param.sizes() = [16, 2], strides() = [1, 16] (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass the number of iteration is 0 the cells are: [{'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(27.), tensor(50.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(27.), tensor(50.)]}] /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 2], strides() = [2, 1] param.sizes() = [16, 2], strides() = [1, 16] (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass the number of iteration is 0 the cells are: [{'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(27.), tensor(50.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(27.), tensor(50.)]}] Traceback (most recent call last): File "", line 1, in File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 288, in run_path return _run_module_code(code, init_globals, run_name, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/examples/optimization/heuristic_closed_form/pf.py", line 42, in po_objs_list, po_vars_list, jobIds, time_cost_list = moo.solve(moo_algo, solver, add_params) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/generic_moo.py", line 140, in solve po_objs, po_vars = pf.solve(wl_id, accurate, alpha, self.var_ranges, self.var_types, precision_list, n_probes, n_grids=n_grids, max_iters=max_iters, anchor_option=anchor_option) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/progressive_frontier.py", line 79, in solve po_objs, po_vars = self.solve_pf_ap(wl_id, accurate, alpha, self.obj_names, var_bounds, self.opt_obj_ind, var_types, precision_list, n_grids, max_iters, anchor_option=anchor_option) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/progressive_frontier.py", line 305, in solve_pf_ap ret_list = self.mogd.constraint_so_parallel(wl_id, obj=obj_names[opt_obj_ind], opt_obj_ind=opt_obj_ind, File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/solver/mogd.py", line 399, in constraint_so_parallel with Pool(processes=self.process) as pool: File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/context.py", line 119, in Pool return Pool(processes, initializer, initargs, maxtasksperchild, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 212, in init self._repopulate_pool() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool return self._repopulate_pool_static(self._ctx, self.Process, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.`
sinharnab commented 1 year ago

@Angryrou @sinharnab please git pull the latest code.

Running NN and similar problem:

`(UDAO2022) MBPro-Arnab:UDAO2022 arnab$ /opt/anaconda3/envs/UDAO2022/bin/python examples/optimization/neural_network/pf.py -c examples/optimization/neural_network/configs/pf_mogd.json /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/nn/modules/loss.py:530: UserWarning: Using a target size (torch.Size([8])) that is different to the input size (torch.Size([8, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. return F.mse_loss(input, target, reduction=self.reduction) /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 2], strides() = [2, 1] param.sizes() = [16, 2], strides() = [1, 16] (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass the number of iteration is 0 the cells are: [{'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(27.), tensor(50.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(27.), tensor(50.)]}] /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/nn/modules/loss.py:530: UserWarning: Using a target size (torch.Size([8])) that is different to the input size (torch.Size([8, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. return F.mse_loss(input, target, reduction=self.reduction) /opt/anaconda3/envs/UDAO2022/lib/python3.9/site-packages/torch/autograd/init.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 2], strides() = [2, 1] param.sizes() = [16, 2], strides() = [1, 16] (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass the number of iteration is 0 the cells are: [{'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(0.), tensor(68.)], 'obj_2': [tensor(27.), tensor(50.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(4.), tensor(27.)]}, {'obj_1': [tensor(68.), tensor(136.)], 'obj_2': [tensor(27.), tensor(50.)]}] Traceback (most recent call last): File "", line 1, in File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 125, in _main prepare(preparation_data) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 288, in run_path return _run_module_code(code, init_globals, run_name, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/examples/optimization/neural_network/pf.py", line 44, in po_objs_list, po_vars_list, jobIds, time_cost_list = moo.solve(moo_algo, solver, add_params) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/generic_moo.py", line 140, in solve po_objs, po_vars = pf.solve(wl_id, accurate, alpha, self.var_ranges, self.var_types, precision_list, n_probes, n_grids=n_grids, max_iters=max_iters, anchor_option=anchor_option) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/progressive_frontier.py", line 79, in solve po_objs, po_vars = self.solve_pf_ap(wl_id, accurate, alpha, self.obj_names, var_bounds, self.opt_obj_ind, var_types, precision_list, n_grids, max_iters, anchor_option=anchor_option) File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/moo/progressive_frontier.py", line 305, in solve_pf_ap ret_list = self.mogd.constraint_so_parallel(wl_id, obj=obj_names[opt_obj_ind], opt_obj_ind=opt_obj_ind, File "/Users/arnab/Documents/work/repos/EcolePoly/UDAO2022/optimization/solver/mogd.py", line 399, in constraint_so_parallel with Pool(processes=self.process) as pool: File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/context.py", line 119, in Pool return Pool(processes, initializer, initargs, maxtasksperchild, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 212, in init self._repopulate_pool() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool return self._repopulate_pool_static(self._ctx, self.Process, File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch prep_data = spawn.get_preparation_data(process_obj._name) File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data _check_not_importing_main() File "/opt/anaconda3/envs/UDAO2022/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main raise RuntimeError(''' RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.`
QiFAN-lix commented 1 year ago

Hi @sinharnab , did you run on ercilla or on your laptop? It seems to be the issue of multiprocessing.

Angryrou commented 1 year ago

@sinharnab You should test it on the Ercilla.

@QiFAN-lix please also mention it in the README that our PF-AP currently does not support on OSX

QiFAN-lix commented 1 year ago

Hi @Angryrou and @sinharnab , I updated the 3D examples in readme. Please feel free to let me know if you have any suggestions. Now I am doing a final check on the code to make it cleaner, and double-check the readme again.

BTW, when will we package our MOO code?

P.S. I sent you the email over 1 hour ago but it just reminded me it fails to send.

Angryrou commented 1 year ago

I checked your code. It looks good to me.

You can go ahead to package the MOO code yourself.

QiFAN-lix commented 1 year ago

Hi @Angryrou , thanks for your feedback! Could you please share your experience on the internal code release before? I remembered Yanlei mentioned you did an internal code release to Luciano previously as well.

Hi @sinharnab , do you have any comments? If no more, could you please help me with the code packaging? I am not clear on how to package it.

Angryrou commented 1 year ago

@QiFAN-lix you can check here as my release to Luciano.

I prepared a clear README with the runnable examples for him. We also have one follow-up discussion for some clarification.

sinharnab commented 1 year ago

Hi @Angryrou and @sinharnab , I updated the 3D examples in readme. Please feel free to let me know if you have any suggestions. Now I am doing a final check on the code to make it cleaner, and double-check the readme again.

BTW, when will we package our MOO code?

P.S. I sent you the email over 1 hour ago but it just reminded me it fails to send.

Hi Qi,

I executed on Ercilla. I have errors for PF-AP (with MOGD) 3D.

(UDAO2022-release) [arnab@node18 UDAO2022]$ python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/3d/pf_mogd.json /home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 3], strides() = [3, 1] param.sizes() = [16, 3], strides() = [1, 16] (Triggered internally at ../torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve send(conn, destination_pid) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s: File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/socket.py", line 544, in fromfd nfd = dup(fd) OSError: [Errno 24] Too many open files Exception in thread Thread-1: Traceback (most recent call last): File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/pool.py", line 513, in _handle_workers cls._maintain_pool(ctx, Process, processes, pool, inqueue, File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/pool.py", line 337, in _maintain_pool Pool._repopulate_pool_static(ctx, Process, processes, pool, File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__ self._launch(process_obj) File "/home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/multiprocessing/popen_fork.py", line 65, in _launch child_r, parent_w = os.pipe() OSError: [Errno 24] Too many open files

sinharnab commented 1 year ago

Hi @Angryrou , thanks for your feedback! Could you please share your experience on the internal code release before? I remembered Yanlei mentioned you did an internal code release to Luciano previously as well.

Hi @sinharnab , do you have any comments? If no more, could you please help me with the code packaging? I am not clear on how to package it.

The 3D examples ran successfully on my laptop. Had only one error on Ercilla. Concerning the packaging how do you want it to be @Angryrou ? If I remember correctly, Yanlei had said she's ok as a repository.

QiFAN-lix commented 1 year ago

Hi @sinharnab , the 3D pf works fine on node 19.

image

Could you check the following command?

image
Angryrou commented 1 year ago

@sinharnab Yes, a repository is good enough for the first and internal release.

Yanlei mentioned to consider the optimization module as an independent package, but with example models so the other people how to plug in a model, custom objectives and parameters to use this package.

Currently, our optimization module is independent of the trace module and the model module. For me, the current branch is fair for the first release.

How do you guys think?

Angryrou commented 1 year ago

[Errno 24] Too many open files is an OS error. To fix it, you need to tune an OS parameter with root privilege (you can google the error).

I did it already for node 19. Please focus on node 19 for the test at the moment. @sinharnab

QiFAN-lix commented 1 year ago

Hi @sinharnab , did you pass the test running on the Ercilla?

Hi @Angryrou , I agree with releasing the current repository with the optimization branch.

Angryrou commented 1 year ago

Please go ahead to make a tag and release to Luciano and Yanlei.

sinharnab commented 1 year ago

python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/3d/pf_mogd.json

I confirm it works properly on node19. @QiFAN-lix @Angryrou

(UDAO2022-release) [arnab@node19 UDAO2022]$ python examples/optimization/heuristic_closed_form/pf.py -c examples/optimization/heuristic_closed_form/configs/3d/pf_mogd.json /home/arnab/miniconda3/envs/UDAO2022-release/lib/python3.9/site-packages/torch/autograd/__init__.py:173: UserWarning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [16, 3], strides() = [3, 1] param.sizes() = [16, 3], strides() = [1, 16] (Triggered internally at ../torch/csrc/autograd/functions/accumulate_grad.h:193.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Pareto solutions of wl_None: [[ 0. 2.87662058 7.77752971] [ 0. 0.74579052 9.90835976] [ 8.91249466 0.77241617 5.16924715] [ 8.91249466 1.96074879 3.98091435] [ 9.48999977 0.66689998 2.84310007] [11.96000004 0.97759998 0.0624 ] [13. 0. 0. ]] Variables of wl_None: [[0. 0.27 0.39] [0. 0.07 0.39] [0.4 0.87 0.19] [0.4 0.67 0.19] [0.27 0.81 1. ] [0.08 0.06 1. ] [0. 0.24 1. ]] Time cost of wl_None: 24.9175443649292 Test successfully!

QiFAN-lix commented 1 year ago

Please go ahead to make a tag and release to Luciano and Yanlei.

Hi @Angryrou , could you share Luciano's email address so that I can share the repository to him?

Angryrou commented 1 year ago

ldpalma@amazon.fr

Luciano is already in the repository. You should build a tag and email everyone for the release.