LightForm-group / non-repo-issues

A repo to hold any project tasks that don't have their own repo, or aren't relate to code.
0 stars 0 forks source link

Yuchen Zheng matflow help #21

Open gcapes opened 3 months ago

gcapes commented 3 months ago

Yuchen is already using Matflow on the CSF and is simulating the compression of Al crystals. When the volume elements are compressed by 60%, there is an error (which others get too) which results from the deformed shape of the volume element. Remeshing is a potential solution. So the proposal is to apply the load incrementally, but when he tries this, the task simply repeats the same simulation instead of continuing where it left off. So this looks like a Matflow problem with how to use the output from the previous step to continue the simulation.

Actions:

gcapes commented 3 months ago

Hi @YuchenZZheng I've just run the example_problem.yaml workflow that you send me, and haven't found any errors. Should I get errors, or would I need to inspect the output files to realise the simulation has aborted? I might just not be looking in the right place.

YuchenZZheng commented 3 months ago

I suppose you can use 'matflow-dev show -f' command to see if there's a error, or go to the "execute/task_2_simulate_VE_loading_damask/e_0/r_0/stderr.log" of the output folder.

gcapes commented 3 months ago

Perfect, thanks! Not sure why I didn't see that earlier :man_shrugging:

gcapes commented 3 months ago

With the caveat that I've really no idea what I'm looking at 😄 , it seems that the damask_post_processing step modifies an hdf5 file, but by default doesn't save it (in the artifacts directory, only in the execute directory), but uses the modified hdf5 file to do subsequent post-processing and plotting because they're in the same task.

The only hdf5 files I've found are

./execute/task_2_simulate_VE_loading_damask/e_0/r_0/geom_load.hdf5 
./execute/task_4_simulate_VE_loading_damask_2/e_0/r_0/geom_load.hdf5

and I think they're created by <<script:damask/write_geom.py>> in the simulate_VE_loading_damask task schema.

I'm not sure, but it might be that in order to access this in the next task, it needs to be saved. I think your second loading task is using the geom.vti file as the input. Given I don't really understand whether a VE_response output is the same thing as a volume_element input, you might have some success changing the default save_files: false to save_files: true on whichever of the output file parsers is creating the input file you need for the next task.

https://github.com/hpcflow/matflow-new/blob/aplowman/develop/matflow/data/template_components/task_schemas.yaml#L497

gcapes commented 1 month ago

Hi @YuchenZZheng, Did you get this sorted in the end?

YuchenZZheng commented 1 month ago

Yes, I did. Thank you for the help.

gcapes commented 1 month ago

Would you be able to explain the fix?

YuchenZZheng commented 1 month ago

Sorry, I thought we are talking about getting the latest version of MatFlow work. To be honest, the new version didn't solve any of my previous problem.

gcapes commented 1 month ago

I've just tried to run this example again with my newly installed matflow-full-env and matflow version on CSF3, and get this error now:

$ matflow go example_problem.yaml 
/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed
from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be 
removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
ERROR matflow.persistence: batch update exception!
Traceback (most recent call last):
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/bin/matflow", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/cli.py", line 161, in make_and_submit_workflow
    out = app.make_and_submit_workflow(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/app.py", line 280, in <lambda>
    return lambda *args, **kwargs: func(*args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/app.py", line 1403, in _make_and_submit_workflow
    submitted_js = wk.submit(
                   ^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/workflow.py", line 2330, in submit
    exceptions, submitted_js = self._submit(
                               ^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/log.py", line 25, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/workflow.py", line 2237, in _submit
    new_sub = self._add_submission(tasks=tasks, JS_parallelism=JS_parallelism)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/log.py", line 25, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/workflow.py", line 2590, in _add_submission
    jobscripts=self.resolve_jobscripts(tasks),
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/log.py", line 25, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/workflow.py", line 2620, in resolve_jobscripts
    js, element_deps = self._resolve_singular_jobscripts(tasks)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/log.py", line 25, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/workflow.py", line 2663, in _resolve_singular_jobscripts
    res, res_hash, res_map, EAR_map = generate_EAR_resource_map(task, loop_idx_i)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/log.py", line 25, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/submission/jobscript.py", line 59, in generate_EAR_resource_map
    res_hash = run.resources.get_jobscript_hash()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/element.py", line 253, in get_jobscript_hash
    dct["scheduler_args"]["options"] = _hash_dict(scheduler_args["options"])
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/iusers01/support/mbexegc2/yuchen-zheng/.venv/lib/python3.11/site-packages/hpcflow/sdk/core/element.py", line 241, in _hash_dict
    keys, vals = zip(*d.items())
                      ^^^^^^^
AttributeError: 'list' object has no attribute 'items'
YuchenZZheng commented 1 month ago

Hi Gerard, it might because of the format of resource block. Please try it to:

resources:
  any:
    scheduler: sge
    scheduler_args:
      shebang_args: --login
      options:
        -l: short
gcapes commented 1 month ago

Thanks Yuchen - it's now running.

gcapes commented 1 month ago

Ok, so I get an error in the output from the simulate_VE_loading_damask task, which I'll look at when I'm at home (I've saved a copy of the workflow directory to look at later).

gcapes commented 1 month ago

I've sent the output to Adam to get his thoughts.

gcapes commented 1 month ago

This looks like a damask error rather than a matflow error. Might be best to ask Joao for input?

JQFonseca commented 1 month ago

I have lost track of what error this is. Is it this: https://github.com/LightForm-group/non-repo-issues/issues/21#issuecomment-2247535241

gcapes commented 1 month ago

No, it's this in the stderr.log file

INFO:    Detected Singularity user configuration directory

 ┌─────────────────────────────────────────────────────────────────────┐

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        error                                        │
 │                        950                                          │
 ├─────────────────────────────────────────────────────────────────────┤
 │ max number of cut back exceeded, terminating                        │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        error                                        │
 │                        950                                          │
 ├─────────────────────────────────────────────────────────────────────┤
 │ max number of cut back exceeded, terminating                        │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘
 │                        error                                        │
 │                        950                                          │
 ├─────────────────────────────────────────────────────────────────────┤
 │ max number of cut back exceeded, terminating                        │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        error                                        │
 │                        950                                          │
 ├─────────────────────────────────────────────────────────────────────┤
 │ max number of cut back exceeded, terminating                        │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL IEEE_INEXACT_FLAG
STOP 1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL IEEE_INEXACT_FLAG
STOP 1
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_OVERFLOW_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL IEEE_INEXACT_FLAG
STOP 1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL IEEE_INEXACT_FLAG
STOP 1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51711,1],2]
  Exit code:    1
--------------------------------------------------------------------------