SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

What do simulation tools do when the variables of a data generator or the data generators of an output have different shapes? #77

Closed jonrkarr closed 3 years ago

jonrkarr commented 3 years ago

Because data generators can involve variables from multiple tasks, the variables which contribute to a data generator can have different shapes. Similarly, outputs can involve data generators that have different shapes.

For multiple reasons, this can be the intended behavior. For example, if simulations run until a stop condition, the length of simulation can be stochastic. This lead can lead to variables and data generators with different shapes.

I think a sensical behavior for reports is to NaN-pad results. Similarly, data generators can be calculated up to the smallest size of their variables. All other values can be NaN.

Toward consistent behavior across simulation tools, what do other tools do in these cases?

matthiaskoenig commented 3 years ago

In my opinion the calculation should fail. Variables of a datagenerator must have the identical shape (or shapes which just misses certain dimensions and in the calculation are repeated to fill up the complete shape). Similar to multiplying a scalar with a vector, but in higher dimensions. I fail if the shapes are not consistent.

jonrkarr commented 3 years ago

What should happen if the length of a simulation is stochastic? I think this is a case where different shapes are legitimate. In such cases, NaN-padding seems reasonable to me.

matthiaskoenig commented 3 years ago

What I am doing in such cases is storing the original results, e.g., for a repeated stochastic simulation a list of timecourses with different length. I then perform an interpolation step on the time to get nice data matrices with equal length and equal timepoints. This introduces the overhead of interpolation and can result in a large number of timepoints (Nt * Nrepeat).

It is not defined by SED-ML how to deal with such cases. Similar things will occur in RepeatedTasks of NonUniformTimecourses.

luciansmith commented 3 years ago

I'm not able to envision what the problem is here. If time course lengths are stochastic, then the 'time' variable will match the output, and there will be nothing to pad. It would still be possible to plot everything on a single graph, if you wanted, as long as you matched each time axis with the corresponding outputs. You could also print everything to a report with the appropriate dimensions for everything.

Can someone explain what the issue might be here?

jonrkarr commented 3 years ago

Results of different shapes can arises in multiple ways:

I think this may be addressed by other edits that essentially say all of the above is either not supported or invalid.

luciansmith commented 3 years ago

Would it be worth adding a validation rule that every dataReference in a Report should have the same dimensionality and size? Or will (say) HDF5 handle this without comment?

jonrkarr commented 3 years ago

This started as a question about what other tools are doing so we could try to align BioSimulators with that. I gather this issue has been overlooked by other tools.

Would it be worth adding a validation rule that every dataReference in a Report should have the same dimensionality and size?

I think that's overly restrictive. I think there's legitimate reasons to have different shapes such as simulations whose length is variable (e.g., simulations that end at a stop condition rather than at a predetermined time). Instead, datasets can be padded with NaN to the same shape and metadata can be recorded in HDF5 about the original shapes. We have this working.

Similarly, the result of calculations which involve non-existent elements of variables could be defined to be NaN. For example, if X=[1, 2] and Y=[1, 2, 3] then X_{3} could be interpreted to be NaN and R_{3} = X_{3} + Y_{3} could be defined to be NaN. BioSimulators also does this.

Or will (say) HDF5 handle this without comment?

In my opinion, this isn't an HDF5 issue. The same issue applies to other formats.

I would say something to the effect that (a) calculations on different shaped values should return NaN for elements beyond the size of the smallest input and (b) different shaped data sets can be encoded into tabular and matrix formats (e.g., CSV, TSV, XLSX, HDF5) by NaN-padding each dataset to the size of the largest dataset.

I think it could be ok to say that the X, Y, and Z data generators of curves/surfaces must have the same shape.

(Ideally, we could statistically validate all of the above. However, this is complicated because the shape depends on the algorithm and symbol. E.g., spatial simulations may produce multiple dimensional results, as could symbols for terms such as Jacobians. Static validation would require compiling this information into KiSAO, the BioSimulators database, or elsewhere. If number of steps is dropped, the shape will also depend on algorithm parameters and the particularities of specific simulations. Instead, we'll have to raise warnings.)

luciansmith commented 3 years ago

I feel kind of iffy about prescribing NaN's everywhere; it seems like this will give you incorrect results in some cases. Take our non-uniform time course example. Imagine we're outputting every time the values change, meaning that the values are unchanged between outputs. If we repeat this twice, we might have output like this:

r1:
time | S1
0.0     5.0
0.8     5.3
1.1     5.7
1.6     6.1
2.0     6.1

r2:
time | S1
0.0     5.0
0.7     5.1
1.4     5.2
1.5     6.3
1.7     6.8
2.0     6.8

If you have a plot that's 'time' by 'S1', you just have all 11 points plotted, with the first five connected by a line and the next six connected by a line.

If you have a 2D report, I can imagine wanting a couple versions of the output, one with sorted 'time':

time | S1,r1 | S1,r2
0.0     5.0      5.0
0.7              5.1
0.8     5.3
1.1     5.7
1.4              5.2
1.5              6.3
1.6     6.1
1.7              6.8
2.0     6.1      6.8

Or you might sort them by order of output:

time | S1,r1 | S1,r2
0.0     5.0
0.8     5.3
1.1     5.7
1.6     6.1
2.0     6.1 
0.0             5.0
0.7             5.1
1.4             5.2
1.5             6.3
1.7             6.8
2.0             6.8

In either case, 'NaN' is the wrong way to fill in the blanks: you actually know what the values in the blanks are, since you know how the data was collected and why. I would naturally want to leave them blank, since that means 'we don't know' to me, while 'NaN' means 'we know it is impossible to know this'. (I guess it's the difference between agnosticism and atheism ;-)

I don't have any good ideas for better ways to resolve this, but in any event, I'm happy to not put in a new validation rule if you feel it works fine as-is. I did already put in a rule that plot/surface data has to match x/y and x/y/z.

I don't really like suggesting that everyone fill blanks with NaNs, though I do think it's fine if Biosimulators does it.

jonrkarr commented 3 years ago

We're using NaN differently. If there are two simulations with two different timelines they could get encoded like this.

time-1 S1,r1 time-2 S1,r2
0.0 5.0 0.0 5.0
0.8 5.3 0.7 5.1
1.1 5.7 1.4 5.2
1.6 6.1 2.1 6.3
2.0 6.1 2.4 6.8
2.4 6.1 3.1 6.8
2.8 6.1 3.8 6.8
3.2 6.1
3.6 6.1
4.0 6.1

Under our scheme, NaN's are only padded to the end and we don't interpolate as you outline.

Using this effectively requires multiple datasets for each time. An even clearer way to deal with this is to separate results for different timelines to different reports -- this isn't required, but it could be a good best practice.

You implicitly outline special treatment for the time symbol. I have thought about this, but concluded that (a) SED-ML doesn't currently recognize anything like this, (b) ideally, time would be handled like everything else -- no special case to complicate things, and (c) some languages such as CellML don't use the time symbol.

If time got special interpolative treatment, the meaning of NaN's in the middle of results wouldn't have to be interpreted as you suggest. But, I agree that this scheme with NaN in the middle could be confusing to deviate from the conventions used by tools such as matplotlib and MATLAB.

luciansmith commented 3 years ago

I added the following to the end of the 'DataGenerator' section, trying to

"It is left up to interpreters how to store or output ‘ragged’ matrices, where the data in some dimensions might not have the same lengths as each other. One practice is to leave the data in this uneven state; another option is to fill out the ‘missing’ data with NaNs. The only requirement is that mathematical operations should not be affected by this choice. For example, the ‘mean’ of a vector should be the same whether or not it was extended with NaNs."

jonrkarr commented 3 years ago

Sounds good.

Conceptually, BioSimulators does treat things something like a "ragged right matrix". We only insert NaNs when things of different shapes need to be concatenated together. Namely when data generators with different shapes are compiled into a report and saved to CSV or HDF5. When we do this with HDF5, we record the shape of each data generator so their raggedness can be recovered out of HDF5 later. (We don't do this with CSV because there's no standard place to put this information.)

One issue that might still be unclear is what should be done with a calculation in a data generator involves variables with different shapes. This is a little different than mean vs nanmean. e.g.,

data generator = A + B,

where is A has length x=l and B has length y < x. In such cases, I think A(l) should have the value NaN.

luciansmith commented 3 years ago

Would you be OK with that being illegal? The other option is to find a situation where it would be helpful, and decide what to do based on what makes sense in that situation. But I'm coming up blank trying to think of an example where you have this problem.

jonrkarr commented 3 years ago

This can arise by defining two tasks with simulation of different lengths and writing a data generator that adds their results together. I can't come up with a great reason to do this in the context of the relatively simple things that could be expressed with SED-ML, but I don't think this is something couldn't have a legitimate use. I can envision more complicated scenarios that would produce different shape variables, but they'd likely need to be implemented outside of SED-ML, just using SED-ML for individual simulations.

Different-shaped variables could become more of an issue if UniformTimeCourse become TimeCourse and the number of timepoints and/or intervals becomes variable. For one, it will be difficult to validate that variables have the same size outside the context of specific simulation algorithms or potentially outside specific simulation executions. For example, if TimeCourse is used to record events rather than fixed steps, then it will not be possible to know the length ahead of time. Averaging over events could be reasonable.

luciansmith commented 3 years ago

OK, fair enough! Added the following to the MathML section (keeping a bit for context):

"...would only be valid if M and N have the same dimensions, and Ri;j would be equal to Mi;j + Ni;j. If the lengths of the dimensions are not equal (i.e. if Mi;j exists but Ni;j does not), the missing value should be assumed to be NaN (not a number)."

jonrkarr commented 3 years ago

If you prefer the "ragged right" way of thinking of it, data generators which involve variables of different shapes should produce results whose shape is the smallest of that of the associated variables. All computations beyond that (involving elements not defined in one or more variables) don't produce results because they are ill-defined.

luciansmith commented 3 years ago

I think this should let downstream users either drop or keep the NaNs as they see fit. If anyone ever comes up with a good use case for it, I'm going to bet that they'll also have a preference about which way to interpret the data that works in that context.