GEOS-ESM / swell

Workflow system for coupled data assimilation applications
https://geos-esm.github.io/swell/
Apache License 2.0
15 stars 3 forks source link

(SE) A design choice: Using absolute vs. relative paths #340

Closed Dooruk closed 1 month ago

Dooruk commented 4 months ago

While trying to run tier1 tests with the 3dvar_atmos suite, I received the following error:

FATAL from PE     0: get_ascii_file_num_lines: File /discover/nobackup/gmao_ci/swell/tier1/9003594169/3dvar_atmos/swell-3dvar_atmos-9003594169/stage/fv3-jedi/geos_atmosphere/fv3fil does not exist.

Image              PC                Routine            Line        Source
libifcoremt.so.5   00002B2AA45C928B  tracebackqq_          Unknown  Unknown
libfms.so          00002B2A9608E51E  mpp_mod_mp_mpp_er     Unknown  Unknown

I noticed the length of the missing path is 128 characters and that is likely the culprit. I've had this issue with SOCA (MOM6) as well, it's a limitation of the fms handling .nml files. After talking to @rtodling he mentioned some newer fms versions do support 256/512 characters, however there is some discrepancy between JCSDA and GEOS versions in terms of fms, perhaps ours is newer, @mathomp4?

While designing the Swell tasks, the idea was to be able to execute them from any location, hence we started using absolute paths. However, it's also possible to just use os.chdir command inside the scripts to ensure we are in the proper directory instead of using absolute paths (similar to how it's done in slurm *.sh scripts).

I'm not sure if there are any reasons to avoid relative paths but I'm happy to hear any opposing thoughts?

mathomp4 commented 4 months ago

Hmm. Well, I suppose it wouldn't be impossible to increase that string's length. I mean, I see the 128 in there, but it's also there in current FMS as well.

I could up the length, as long as it doesn't affect work by @climbfuji and @cmgas. I don't think it would...

I've already released 2 FMS's this week, so I could do a third!

ETA: My guess is this wouldn't affect GEOS proper since our input.nml behavior is sort of what FMS expects (even now). So moving from 128 to 1024 shouldn't matter since our file name will be input.nml. I'll try now to confirm.

climbfuji commented 4 months ago

As said above, newer versions of fms support longer strings. Maybe we should spend the effort on updating to fms@2023.04 instead of backporting all these bugfixes/updates?

mathomp4 commented 4 months ago

Well, in this case, even the latest FMS might have issues from my reading of the code. From their main:

    character(len=128) :: filename
...
    filename='input_'//trim(pelist_name)//'.nml'
    inquire(FILE=filename, EXIST=file_exist)
    if (.not. file_exist ) then
       if (present(alt_input_nml_path)) then
          filename = alt_input_nml_path
       else
          filename = 'input.nml'
       end if
    endif

To my reading, if the alt_input_nml_path is longer than 128 characters, this will crush it.

climbfuji commented 4 months ago

Well, in this case, even the latest FMS might have issues from my reading of the code. From their main:

    character(len=128) :: filename
...
    filename='input_'//trim(pelist_name)//'.nml'
    inquire(FILE=filename, EXIST=file_exist)
    if (.not. file_exist ) then
       if (present(alt_input_nml_path)) then
          filename = alt_input_nml_path
       else
          filename = 'input.nml'
       end if
    endif

To my reading, if the alt_input_nml_path is longer than 128 characters, this will crush it.

Wow. In https://github.com/NOAA-GFDL/FMS/pull/1003 we changed the restart length to 256, but apparently there are more of these 128 character limits!

mathomp4 commented 4 months ago

If you are increasing lengths, I recommend 1024 (aka ESMF_MAXPATHLEN). In MAPL/GEOS, we found in some cases that 256 wasn't enough. Deeeeeeeep paths. :)

climbfuji commented 4 months ago

https://github.com/NOAA-GFDL/FMS/issues/1518

Dooruk commented 1 month ago

Well, now this issue is impacting current SWELL Tier 1 tests. Because Tier1 test folders on gmao_ci have gone up another digit from 10 to 11 10012490739 and some of the file paths are getting caught by the 128 character limit.

Wonder if the latest FMS 2023.04 update in JEDI would resolve this issue?

Edit: Looks like FMS 2022.03 has a path length update with FMS_IO, so hopefully that should resolve this issue.

https://github.com/NOAA-GFDL/FMS/releases?page=2

ashiklom commented 1 month ago

What's the timeline on that FMS update? I would really like to have some working integration tests for Swell, because those are holding up other larger development efforts I want to start with Swell.

Or, are there any blocking issues, or issues that would help get that in faster? #345? #360?

Dooruk commented 1 month ago

I have a draft PR in our jedi_bundle wrapper. It's draft because I was able to build soca with the latest JEDI updates which includes FMS merge. fv3-jedi does not build, due to some library object linking issue. We identified an issue with @mathomp4 but there is still something wrong.

I will try one of these CMakelists.txt options from here next:

https://github.com/NOAA-EMC/GDASApp/pull/1221/files

asewnath commented 1 month ago

@ashiklom #345 is waiting on the FMS PR so that BuildJedi issues are resolved

Dooruk commented 1 month ago

I think I figured it out, fv3-jedi is build, I will update the draft PR.

Dooruk commented 1 month ago

Update, I tested our CI workflow with a recent build that uses FMS version 2023.04 and the path length issue unfortunately persists. We have an open issue on FMS but even if they make changes that would be in a future version.

So the options are:

1) We deal with the 128 character limit. 2) We could modify with string sizes in our local FMS 2023.04 build source files. This is obviously complicated as there is no single 'max_path_length` variable, requires jcsda folder access on Discover, and time. 3) ??

mathomp4 commented 1 month ago

Could you "workaround" with symlinks say? Refer to files local to where something is running but all that is is a symlink to a file somewhere else?

Though I suppose even that wouldn't work if the symlink itself is more than 128 chars long...

Dooruk commented 1 month ago

You are right @mathomp4. Option 3 is to change these following key/value pairs taht are used by FMS to local cycle directory:

https://github.com/GEOS-ESM/swell/blob/0ea42f6cd471600af1d9527ee24556d81ad2c7dc/src/swell/configuration/jedi/interfaces/geos_atmosphere/model/geometry.yaml#L2-L3

Similar to how we handle geos_ocean:

https://github.com/GEOS-ESM/swell/blob/0ea42f6cd471600af1d9527ee24556d81ad2c7dc/src/swell/configuration/jedi/interfaces/geos_ocean/model/geometry.yaml#L1-L2

Stage task for atmosphere puts static files in the experiment directory, that way it's one and done.

Stage task for ocean requires cycle time as an input and links static files at each cycle and puts them in the cycle directory. That way it's possible to point to local files.

ashiklom commented 1 month ago

I like the the idea of symbolic links. My suggestion might be to combine symbolic links, relative file paths, and the cwd argument to subprocess.run. That also has other positive externalities, like making it easier to inspect swell tasks as relatively self-contained execution units (e.g., from inside a given task directory, you have everything you need without needing to navigate to parent directories or other parts of the file system).

For linking to static files, a possible solution is to cleverly manipulate relevant environment variables (e.g., LD_LIBRARY_PATH, INCLUDE_PATH) specifically in the context of task execution. E.g., Rather than having a single set of swell modules for the entire swell workflow, we load modules (or otherwise modify the environment) individually for every task.

Dooruk commented 1 month ago

More updates on this:

Ok, figured out how to use relative paths, see below PR.