biosimulations / biosimulations

A platform for sharing and reusing biomodeling studies including models, simulations, and visualizations of their results
https://biosimulations.org
MIT License
37 stars 13 forks source link

Handle reports larger than 16mb #1766

Closed bilalshaikh42 closed 3 years ago

bilalshaikh42 commented 3 years ago

The same limitation in #1762 applies to each report as well.

We will need to find a way to break up the results files into multiple database entries if they are greater than 16mb per report. The simplest solution would be to spit per output variable, but this would be incredibly inefficient for cases with many variables that could have fit into a single document.

One potential patterns that Mongo encourages are binning, which we could perhaps use. The bins would be created on the length of each variable array, as a factor of the number of outputs. Each report id could be appended with a number and a flag in the api would indicate the need to pull an additional report for the complete datasets.

bilalshaikh42 commented 3 years ago

@jonrkarr what are your thoughts on this? I am strongly in favor of making this work within the MongoDB framework due to the tools that it would give us for analysis. But I wonder if I am looking at this too narrowly since I am familiar with MongoDB but not other more "big-data" approaches

@moraru How does vcell store large result sets? Are they just files, or is it possible to do server side analysis that can be exposed to clients via an API?

jonrkarr commented 3 years ago

First, I think there's two separate questions here: a. Format that simulators produce data in b. Format that the database uses to serve slices of data

I think we should stick with HDF5 for (a).

I don't have any preconceptions about (b). Saving each dataset to a separate document could work. That would give a 16 mb limit per dataset, which is still finite, but I think pretty big. Essentially, this amounts to flattening out the data structure.

How to store such data I think is still an open topic. Often, the storage of such numerical data is managed ad hoc in a file system. The data could be stored in a uncompressed format such as parquet that is well-suited to parallel processing. I think dask, feather, and arrow all fall into this category. The formats are motivated by data processing, rather than data management.

Some potential solutions for managing the data other than MongoDB:

bilalshaikh42 commented 3 years ago

Yes, this question did assume HDF5 was the output and was only open regarding part b. By dataset are you referring to each each sedml report? Would this not be limiting for very long simulations with many timepoints? The report could contain multiple variables for the output, which each variable limiting the total number of timepoints possible. I am not sure what the "typical" and "possible" sizes could be for the files.

jonrkarr commented 3 years ago

In SED-ML terminology, a "dataset" is a row of a HDF5 file. For non-spatial models, 16mb divided by 4 bytes per time point, would support 4 million timepoints -- plenty. However, for a spatial model with a 100 x 100 x 100 grid, this would only support 4 time points. Seems like a problem ...

moraru commented 3 years ago

@bilalshaikh42 vcell has a strongly bimodal distribution of result set sizes - most non-spatial data is in the MB range while most spatial data is in the GB range; what we do is for all non-spatial sims we push the entire dataset to the client, and all processing/filtering/export for whatever purpose (vis/save/etc. happens client-side), and for all spatial sims we do processing and data reduction server-side, pushing to the client only what needs to be displayed, plus the option of filtered or full download as a separate option

moraru commented 3 years ago

this is also done in conjunction with check-pointing while running: non-spatial solvers will update a single binary result set through the entire run (currently file but could be db entry), while spatial solvers will output periodically to sequential outputstreams (again currently file, but could be object or db entry)

moraru commented 3 years ago

using hdf5 would bring not only scalability but also great flexibility on what is being done server side vs client side

moraru commented 3 years ago

In SED-ML terminology, a "dataset" is a row of a HDF5 file. For non-spatial models, 16mb divided by 4 bytes per time point, would support 4 million timepoints -- plenty. However, for a spatial model with a 100 x 100 x 100 grid, this would only support 4 time points. Seems like a problem ...

@jonrkarr not sure I follow - so leaving the spatial solvers aside for now, an output of an average size model with 50 variables with their values as doubles will use 400 bytes per timepoint; while this is still enough for 10,000 timepoints in 16 MB, which should be OK for "normal" uniform time course output requests, there are a number of issues with it: (i) for variable time step solvers the uniform output is usually sampled from the actual output which is much larger; (ii) it will be usually not enough anyways if asked to produce the full output (which generally is about 1-2 orders of magnitude higher because most simulations are initialized far from equilibrium and are quite stiff at least initially); and (iii) of course a silly error can go bonkers easily (see BMDB model 802 from test suite which produces a 1.5 GB dataset)

jonrkarr commented 3 years ago

Mongo's 16 MB limit is per document. In putting results in MongoDB, we could use 1 document per SED-ML dataset (trajectory of a single species in a single simulation). For a model with 50 variables, we could use 50 MongoDB documents. This would maximize the size of the results from a COMBINE archive that could be put into MongoDB.

Assuming, each time point was a float, 16MB could accommodate, 4,000,000 time points for a non-spatial model. A lot!

However, a spatial model could easily use 4,000,000 floats for just a few time points of a single variable. A problem!

bilalshaikh42 commented 3 years ago

Got it, so based on this, we will definitely need to move to another datamodel for the spatial models. But for the non spatial, a one dataset per document approach could work, though this is really not how mongo was intended to be used, so we would probably see much better performance/scalability/fewer implementation bugs from some other database.

@moraru for the processing of the results server-side that vcell does, are you using file storage? As in the application is keeping track of each simulation results output paths? Or is there some form of object storage available on the hpc?

moraru commented 3 years ago

OK, we can trim them to floats and break it down that way and stick them in as separate docs. We could even break down spatial datasets so they fit (one time point/variable combo per doc using floats will allow mesh sizes up to 4 million which would cover maybe 99% of "real life" usage for spatial sims from what I've seen). But the bigger problem is that I don't think that Mongo scales as well as advertised and once you have millions of documents that need to be correlated by timepoint, variable, simulation, user, etc., from a PB size database it no fun. I would strongly advocate keeping only metadata in Mongo and actual data in hdf5 separately (object, file, whatever).

jonrkarr commented 3 years ago

Correct, MongoDB is not intended to have huge numbers of documents. That's the issue that Bilal mentioned.

I think we should consider either a hybrid HDF/Mongo approach where MongoDB stores pointers to HDF5 files, or investigate something like a tabular database. The fact that there's no clear technical solution I think is one reason why this kind of data is typically stored ad hoc. Unfortunately, I don't have experience with tabular databases. We'd need to explore.

jonrkarr commented 3 years ago

This is done, except debugging outlined in other issues.