OpenDrift / opendrift

Open source framework for ocean trajectory modelling
https://opendrift.github.io
GNU General Public License v2.0
250 stars 120 forks source link

Suggest changing status history into non-masked array always showing actual status #1420

Open nordam opened 1 month ago

nordam commented 1 month ago

At present, the array o.history['status'] is a masked array, and for a given particle, the status is always 0 (active), except in the timestep in which it became deactivated, where the status will be for example 1 for seafloor or 2 for stranded. For all later times, the status is -999 (and also masked).

When plotting a mass balance, it is convenient to know the status of a particle also at times after it was deactivated. It would therefore be handy if the status array actually held the status at all times. The status array doesn't have to be a masked array either, as any status other than 0 means the particle is not active. It could also a status number (-1, for example) representing particles that are not seeded yet.

One can of course create the array I describe post hoc, like so:

# Hack the status array to show actual status
firstlast = np.ma.notmasked_edges(o.history['status'], axis=1)
first = firstlast[0][1]
last  = firstlast[1][1]
status = o.history['status'].data
for i in range(Np):
    status[i,last[i]:] = status[i,last[i]]
status[status == -999] = -1

but is there any reason not to create the array like this in the first place?

knutfrode commented 1 month ago

Yes, this is a bit awkward, but for legacy reasons:

Today, an Xarray Dataset would be an obvious choice for the datamodel. However, at the birth of OpenDrift, Xarray was not very mature, and a masked array was chosen. In the inner datamodel, (self.environment, self.elements as available inside the update() method in any module), these are arrays corresponding to only the elements presently active. Thus the model developer does not need to worry about the bookeeping of which particles have (not yet) been seeded and deactivated etc. This helps keeping the physics/processes of the modules clean and readable - by separating this from the dirty bookeeping (in Basemodel, Basereader etc)..

And only one timestep is kept in memory (though with some caching of 100 timesteps for performance), and flushed to disk after each timestep. But for performance reason, only active elements are written/appended to disk, thus NaN (or -999) is stored for elements that have been deactivated (not only for status, but for all other properties*). The internal import-method compensatses by backfilling data, but yes, this is not ideal when using 3rd-pard applications for analysis.

We hope to make a new inner datamodel based on Xarray Dataset quite soon, and then this issue will be eliminated.

gauteh commented 1 month ago

@knutfrode Do you still have the experiement from our Mjølfjell-Opendrift-Workshop? I think you created an example with xarray that flushes data to disk on demand and handled by xarray?

knutfrode commented 1 month ago

I had almost forgotten that test!

Could it be this test-script? Here I am writing to zarr. https://gist.github.com/knutfrode/20c9a55d564b934ba0a783f476a65ef3