ethan-pickering / dnosearch_nature_cs_data

Data for Nature CS publication for discovering extreme events.
0 stars 0 forks source link

Request for clarification on LAMP dataset and code modification #1

Open xietaoluo opened 1 year ago

xietaoluo commented 1 year ago

Hello! I recently obtained the LAMP dataset you published, and I noticed that you also provided some code that includes data processing and modification for optimizing LAMP experiments. However, while studying the data handling and code modification, I encountered some trouble. I don't understand the meaning and content of each feature in the dataset, and when I tried to modify the code to apply to my own dataset, I am not sure how to make the modifications correctly. Therefore, I would like to request more detailed information about the dataset, for example, the meaning and labels of each feature, as well as specific steps for code modification. Thank you!

batsteve commented 1 year ago

First, the suffix -10-40 means that the LAMP wave episodes were constructed from a n=10 dimensional reduced order model, and each lasted T=40s in the Eulerian frame. Please see my previous paper in Ocean Engineering (https://doi.org/10.1016/j.oceaneng.2022.112633) for details. In particular, the Ocean Engineering paper expands on both the naval architecture problem and a closely related surrogate modelling approach based on Gaussian Process Regression.

The first three files, DD-10-40.txt, TT-10-40.txt, and VV-10-40.txt describe the ROM for the wave episodes (the input side of the ML). DD is an ordered list of the Karhunen Loeve eigenvalues, VV is an ordered list of the Karhunen Loeve basis functions, and TT is the grid of time values corresponding to those modes. These are important for converting between the ROM and the coefficient representation, which is important for the DeepONet framework.

The output files each have 3000 rows, corresponding to 3000 precomputed simulations. kl-2d-10-40-design.txt is the set of (coefficients) or (10D) design vectors corresponding to the precomputed data. kl-2d-10-40-isgood.txt is a sanity check to make sure that the vessel didn't capsize during each simulation. It can be ignored. kl-2d-10-40-pitch.txt is a set of time series that give the pitch angle (in degrees) of the vessel as it passes through the wave episode. We didn't use pitch in "Discovering..." because the pdf of steady state pitch values is closer to Gaussian (and therefor less interesting). kl-2d-10-40-vmbg.txt is a set of time series that give vessel integrated Vertical Bending Moment (in Newton-meters) as it passes through the wave episode.

kl-2d-10-40-tt.txt gives the time spacing for the output Lagrangian time series. Because of the trimming I've already done to match the Eulerian and Lagrangian frames, this is vestigial and can be ignored. If you want time values for plotting, just use the (constant) spacing.

Further, mc-vbm-bins.txt and mc-vbm-hist.txt give the bin centers and (normalized) histogram densities for the true steady state values of the same quantity from kl-2d-10-40-vmbg.txt. Here, `true' means long-time steady state simulations using the same simulation software and parameters.

Finally, 10-40-sigma-n-list.txt is a list of the aleatoric uncertainty parameters \sigma_n for each output​ mode computed using Gaussian Process Regression. This is a sanity check to compare the uncertainty values produced by different ML techniques.

Ethan, can you add this (more or less Verbatim, maybe) to the readme?