MetalKnight / woss-ns3

WOSS is a framework that permits the integration of any underwater channel simulator that expects environmental data and provides a channel realization. WOSS integrates the Bellhop ray-tracing program. Thanks to its automation the user only has to specify the location in the world and the time where the simulation should take place.
https://woss.dei.unipd.it
30 stars 9 forks source link

"NetCDF: HDF error" in mobile network scenarios #43

Closed emanuelegiona closed 1 year ago

emanuelegiona commented 1 year ago

Dear colleagues,

I am working on mobile network simulation scenarios leveraging WOSS through this integration module.

Simulation scenario

Using the example as base, I modeled my scenario equipping all nodes with WossWaypointMobilityModel, with a set of 4 nodes having fixed location and a single node roaming through the network.

Fixed nodes location example (CSV)

node_id, latitude, longitude, depth
0, 43.8291, 9.56412, 0.00791293
1, 43.8289, 9.56461, 0.00978492
2, 43.829, 9.56498, 0.0115909
3, 43.8289, 9.56476, 0

Roaming node waypoints file example (CSV):

time_seconds, latitude, longitude, depth
0, 43.8289, 9.56476, 0
10.1001, 43.829, 9.56475, 0
20.2994, 43.829, 9.56473, 0
30.3991, 43.829, 9.56472, 0
40.4995, 43.829, 9.5647, 0
50.6004, 43.829, 9.56469, 0
(...)

Geodesic coordinates from such files are fed to the mobility model via the CreateVectorFromCoordZ() after having created woss::CoordZ objects from related fields. Depth is passed both to the CoordZ constructor and via the CoordZ::setDepth() function after each object creation.

WOSS configuration

According to advices regarding "high mobility" scenarios (1)(2), we can devise the following WOSS configurations:

  1. ResDb + no memory optimization WossHelper attributes ResDbFilePath and ResDbFileName are properly defined, and the WossPropModel instance used for WossChannel is created with default attribute values.

  2. No ResDb + memory optimization Following previously mentioned advices, ResDbFilePath and ResDbFileName attributes of WossHelper are left with default values, whereas WossPropModel's MemoryOptimization attribute is set to true.

Outcome: error not appearing

Upon executing my simulation, I noticed the "NetCDF: HDF error" does not appear only if the network is completely static and configured accordingly: i.e. WOSS configuration 1, with the roaming node having just 2 waypoints, one at time 0 and the other at time N, both with having the same position (e.g. waypoint 0), effectively making it a static node as well.

Outcome: error consistently appearing

The error instead consistently shows up whenever the following simulation setups are executed:

Activating all WOSS-related debug options via the WossHelper interface does not shed more light on this error, which is only accompanied by the exception source "ncVar.cpp line:1626".

Simulations during which the error occurs actually run for some time first, and then crash upon this error appearing. In order to rule out possible invalid locations of the roaming node, the "effectively static node" setup has been tested across multiple different locations for both WOSS configuration 1 and 2. WOSS configuration 2 was thus identified as problematic, whereas WOSS configuration 1 did not pose issues in the "effectively static node" case.

System setup

All libraries are installed as per instructions, passing all tests.

MetalKnight commented 1 year ago

Hi @emanuelegiona, WOSS 1.12.3 fixes some issues with coordinates conversions, as reported in this changelog. Please try to update both WOSS and woss-ns3 to the latest 1.12.5. After that, if you are still facing the issue: 1) provide a simple .cpp example so that we can try to reproduce. 2) provide the output of the simulator (standard output to a file) with every single debug option in the helper active. This should help in pinpointing the exact call before the NetCDF error. 3) finally, debug the ns3 example via GDB (check ns3 wiki on how to do this) and when the program will stop due to error, run the backtrace option and report the output here.

Thanks

regards

emanuelegiona commented 1 year ago

Thanks @MetalKnight for the quick reply. I updated to WOSS 1.12.5, as well as its ns-3 integration module, however the problem has not been solved. I am still working on the 3.33 simulator version and using the 2020 GEBCO global grid.

If anything, the outcome has worsened; indeed, when executing simulations in the completely static scenario (characterized by fixed roaming node position and no "high-mobility" WOSS configuration), the "NetCDF: HDF error" is thrown, whereas it was working with WOSS 1.12.0 before.

Please find a simulation script acting as Minimum Reproducible Example at this link, as well as the log files you requested at steps 2 and 3:

MetalKnight commented 1 year ago

thanks @emanuelegiona we will check and get back.

In the meanwhile could you please: 4) provide system specs, (distro, kernel, gcc version) 5) check if the sediment V1 dbs have the same issue. Be aware that you will also need to change SedimentDbDeck41DbType to 0

emanuelegiona commented 1 year ago

Thanks for looking into it, your help is much appreciated.

Below you can find the additional information you asked for:

  1. Distro: Ubuntu 18.04 LTS (also due to acoustic toolbox gfortran requirements) Kernel: 5.4.0-150-generic GCC: 7.5.0

  2. Simulation outcomes when using sediment V1: a. WOSS configuration "no high-mobility" + roaming node fixed at waypoint 0 position: working b. WOSS configuration "no high-mobility" + roaming node: NetCDF: HDF error (ncVar.cpp:1650) c. WOSS configuration "high-mobility" + roaming node: NetCDF: HDF error (ncVar.cpp:1650)

Please find attached GDB backtrace files for scenarios in which the error occurs:

P.S. For the sake of complete transparency, the following code changes have been applied to the MRE: a new CLI option --wossSedimVersion has been added, reflecting a similarly named Experiment member field which is used in the following way:

if(m_wossSedimVersion == 1)
  {
    m_wossHelper->SetAttribute("SedimDbCoordFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_coordinates.nc"));
    m_wossHelper->SetAttribute("SedimDbMarsdenFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_marsden_square.nc"));
    m_wossHelper->SetAttribute("SedimDbMarsdenOneFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_marsden_one_degree.nc"));
    m_wossHelper->SetAttribute("SedimentDbDeck41DbType", IntegerValue (0)); // DECK41 V1 database data format
  }
  else if(m_wossSedimVersion == 2)
  {
    m_wossHelper->SetAttribute("SedimDbCoordFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_V2_coordinates.nc"));
    m_wossHelper->SetAttribute("SedimDbMarsdenFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_V2_marsden_square.nc"));
    m_wossHelper->SetAttribute("SedimDbMarsdenOneFilePath", StringValue (m_wossDbsPath + "/seafloor_sediment/DECK41_V2_marsden_one_degree.nc"));
    m_wossHelper->SetAttribute("SedimentDbDeck41DbType", IntegerValue (1)); // DECK41 V2 database data format
  }
  else
  {
    NS_FATAL_ERROR("Experiment::InitialSetup: invalid WOSS sediment version [1 or 2]");
    return;
  }
MetalKnight commented 1 year ago

related to https://github.com/Unidata/netcdf-cxx4/issues/127

emanuelegiona commented 1 year ago

I see you encountered this very same error before and reported on it as well; apologies for not adding to that open issue myself.

However, is there any temporary known workaround you are using and, in case there is none, would introducing a throttling mechanism be helpful, in your opinion? This so-called throttling mechanism would either be implemented by:

  1. keeping a rough counter to the NetCDF calls, reset once a sleep() function is executed upon reaching a certain limit; or

  2. wrapping the failing NetCDF call to handle the exception, thus invoking sleep() and afterwards retrying the execution of the same NetCDF call.

In both solutions, the sleep() duration should be chosen to introduce as little overhead as possible, otherwise simulations that are long-running or large (in terms of nodes) are going to be affected too much.

Additionally, solution 1 depends on an arbitrary limit of NetCDF calls, that you estimated in the range of 500k: this might be system-dependent. Moreover, the configuration of such limit should account for eventually turning off such throttling, in order to avoid future code changes once the NetCDF team solves this error.

Solution 2 instead does not require the extra effort needed for the previous one, also benefiting the entire WOSS codebase for a more robust interaction with NetCDF. Once a solution on the NetCDF part is implemented, the exception handling code would become obsolete for this edge case, but still be useful in the wake of other cases of exceptions. In the latter implementation, the sleep() duration might be computed similarly to backoffs in communication protocols, gradually increasing at each attempt and until a fixed number of maximum attempts. Once the maximum attempts are reached, execution might crash or provide a safe default value (i.e. simple 3-ray model with flat seabed surface).

MetalKnight commented 1 year ago

@emanuelegiona the problem is that we don't know: 1) if this is a NetCDF4 issue 2) if this is HDF5 library issue

we don't know the reason why the library is throwing an HDF error. So we really don't know if the simulation can continue after that error, meaning that even we catch the error it is still possible that every subsequent getVar() call will fail. I don't know if a sleep() could help. If this is a logical issue, I don't see how this could do it.

First option here is to try to build the latest of HDF5 and NetCDF4, meaning 1) downloading the latest HDF5 (1.14.2) and build it the same instructions 2) download the latest NetCDF4-C (4.9.2) and build it with the same instructions 3) rebuild and relink the NetCDF4-C++ against the two newly installed libraries 4) rebuild WOSS (preferably 1.12.5) against the latest netCDF4 library 5) rebuild woss-ns3 (preferably 1.12.5) against the latest NetCDF4 library

and finally check if the issue is still present.

By the way, I encourage you to move to WOSS 1.12.5 which handles coordinates conversion properly.

MetalKnight commented 1 year ago

Hi @emanuelegiona on my ubuntu 22.04 machine with gcc 11.4.0 and the recommended libraries (WOSS, woss-ns3, NetCDF4, HDF5, NetCF4C++ etc...) and using your example (after having it tweaked with Uan standard PHY and with no cmdline args) issue was reproduced. We will check what happens with the latest HDF5 and NetCDF4-C

MetalKnight commented 1 year ago

@emanuelegiona I can't seem to reproduce the issue with:

How to install.

relaunch the test.

Let me know your results. thanks

emanuelegiona commented 1 year ago

Upgrading such libraries appears to be fixing the crashes.

My tests did not end up crashing in both cases of roaming node and no-high-mobility configuration (identical setup as your execution with no further CLI arguments) as well as roaming node and high-mobility configuration.

Thanks for looking into it.

P.S. On an unrelated note: is there any way to turn off BellhopWoss::checkDepthOffsets() warnings from the ns-3 interface? Even when turning all WOSS debug options OFF, they are still shown. Sorry if this is not the appropriate place to discuss it.

MetalKnight commented 1 year ago

Thanks for confirming this, I will close the issue. woss website has been already updated with the new recommended libraries and installation instructions. That warning is always printed since it tells you that something is not not properly configured with the DepthOffset in your test scenario.

I'll see what I can do in the next WOSS release. cheers