Dimension statistics - Githubissues

connormanning commented 3 years ago

Nowadays the Entwine builder adds detailed dimension statistics to its schema (see here for example) including minimum, maximum, mean, stddev, and variance. Currently this is sort of an undocumented extension intended to be eventually be codified as an optional (in order to be backward-compatible) extension to the EPT specification. Would it be worth specifying a statistics VLR to capture this information? I think the "number of points by return" array in the LAS 1.4 header adds precedent for this kind of thing.

An example might be:

struct BucketItem
{
  double value; // Maybe int64_t here - presumably this is only for integral dimensions
  uint64_t count;
}
struct CopcStatistics
{
  double minimum;
  double maximum;
  double mean;
  double stddev;
  double variance;
  uint64 number_of_buckets; // 0 for most dimensions, but this one is nice for Classification counts
  BucketItem[] buckets;
}

These statistics would then be stored in the order that the dimensions appear in the point data record format header, followed by statistics for extra-bytes dimensions in the order that they appear. Is there enough demand for this type of information to put it in the spec? Enough to require it?

One example of their usage I found is wonder-sk/point-cloud-experiments#60: "QGIS implementation is able to read those stats and use them to set up renderers (min/max values are especially useful to correctly set ranges for rendering)".

If anyone has other concrete use-cases where such data would be used, that would be useful.

hobu commented 3 years ago

Require this, or make it optional?

wonder-sk commented 3 years ago

+1 to include statistics in COPC. It is just few bytes extra in the output file and the computation overhead for writer is negligible, too, yet it makes things much easier for readers like QGIS - no need to run some heuristics to estimate range of values for dimensions...

evetion commented 3 years ago

+1

I'd propose to make these dimension statistics applicable to a voxel/page as well, so you get virtual points. With a quadtree layout these would be comparable to overviews in COGs. These overviews are quite useful for quick visualisation or to be ignored in certain analyses. Probably of interest to @m-schuetz if used in progressive rendering.

edit: I'd say these statistics could be required on a file level (like we already store the min/max of the coordinates) and optional when applied to the remaining voxels, but if applied, it needs to be present for all voxels in the hierarchy.

hobu commented 3 years ago

@evetion Could you please provide some specification language or a struct definition that implements voxel stats? Could you provide some concrete use scenarios for these statistics? This seems more like a 'feature' than a 'need'.

My counterpoints:

They consume a lot of disk space
Accumulating them while building a really large COPC file might difficult
If they are optional, clients will have to support both ways
They are 'nice to have', but for what, exactly?

It is a design goal of COPC to be as simple as possible, have only what is needed for clients to be able to operate, and not to lard the spec up with features that most end consumers of COPC will not end up using. I'm not saying that voxel stats is fatback, but the case hasn't been made yet.

evetion commented 3 years ago

Thanks for the feedback. As a developer I can appreciate a lean standard instead of yet another VLR/compatibility/check.

They are 'nice to have', but for what, exactly?

The whole point of these Cloud Optimized formats is to selectively read parts of interest instead of needing to download the whole file. The current draft only allows those selections based on the x, y or z dimension. Point clouds are much more dimensional than that. For example, I'd like to select only points for a certain datetime range (multiple campaigns in a single .laz) or those with high intensity/r/g/b values (specific infrastructure), or those with at least 2 returns (vegetation).

They consume a lot of disk space

Let's say we store only the bounds (min/max) of each dimension for each voxel, this would be the size of two points. Doing the same for parent voxels in an octree will roughly add 33%, say 3 points. As long as the voxels store roughly thousand(s) of points, which is common, this would be less than a percent compared to the uncompressed data and 3% when compared to compressed data (assuming laszip results in 10% of original size). Note that COGs require overviews, taking a 33% size increase for granted.

Accumulating them while building a really large COPC file might difficult

These statistics can (or at least should) be aggregated from the leaf voxels with points and up from there (parent_dimension_stat = stat(childa_dimension_stat, childb_dimension_stat, ...)), without keeping any points in memory, only the child statistics are needed. On huge (as in span, not necessarily points) COPC files this could be difficult, but the same goes for keeping the actual octree page structure in memory and any streaming-like solution for it would work for the statistics as well.

If they are optional, clients will have to support both ways

Let's make it required then.

Could you please provide some specification language or a struct definition that implements voxel stats?

It could follow the example by @connormanning, but then multiple consecutive objects in the same order of the pages and its entries. Since this would require offsets (due to the unknown length of buckets), I'd first propose a simpler version:

struct DimensionBounds
{
  double minimum;
  double maximum;
}

struct EntryStats
{
    DimensionBounds [number_of_domains_in_pointformat];
    // VoxelKey key; # implicit by using the same order as the entries
}

With the assumption that readers will read the complete octree into memory (or at least until you find the VoxelKey of interest), one can just count the numbers of entries read so far, substract 1, multiply by the number_of_domains_in_pointformat * 16 (size of DimensionBounds) and that's the offset to the corresponding EntryStats in the VLR.

evetion commented 3 years ago

struct CopcStatistics
{
 ...
 double mean;
 double stddev;
 double variance;
 uint64 number_of_buckets; // 0 for most dimensions, but this one is nice for Classification counts
 BucketItem[] buckets;
}

@connormanning I absolutely do see the use case for a minimum and a maximum (see above), but do you know of use cases for the the mean, std, etc.?

m-schuetz commented 3 years ago

I absolutely do see the use case for a minimum and a maximum (see above), but do you know of use cases for the the mean, std, etc.?

With recent year's baby steps towards support for arbitrary attributes in formats and viewers, statistical data can be handy to provide useful defaults for the visualization of attributes with unkown semantics or ranges. I'm already using min and max in Potree for the gradient ranges of attributes, but that approach frequently fails if there are outliers. For example, using min and max for intensity often breaks if the majority of samples are between 0 and 1,000, but a couple of samples with values over 10,000 screws up the range.

hobu commented 3 years ago

What does CopcStatistics mean for discrete variables like Classification or PointSourceId? Does a client want a discriminator to know that an attribute is discrete?

m-schuetz commented 3 years ago

It would be a very useful optional hint, especially in order to distinguish between scalars and enums/integers/flags. I was planning to provide materials for several standard mappings from attributes to colors such as:

Scalar -> RGB via gradient texture. (intensity, gps-time, ...), including transforms (brightness, contrast, logarithm)
Vector -> RGB (color, normals, ...) via 1:1 mapping and/or simple mathematical transformations (brightness, contrast, ...)
Vector -> RGB via matcap (normals).
Enum -> RGB via 8 or 16 bit lookup tables (classification, point source id, ...)

Perhaps there are some more that might be useful.

m-schuetz commented 3 years ago

uint64 number_of_buckets; // 0 for most dimensions, but this one is nice for Classification counts BucketItem[] buckets;

This might also be very useful for gps-time. The sparseness between flight-lines makes it a bit tough to implement useful gps-time filters via sliders. If some empty gps ranges between consecutive flight-lines can be identified with the counts, then the slider could ignore those. It wouldn't be perfect since it likely misses many gaps, but it may or may not help a bit.

hobu commented 3 years ago

Point clouds are much more dimensional than that.

A fair point, but are LAS files often queried and segmented by dimensions other than XYZ? Maybe a little bit, but I'm not convinced that the complication of voxel statistics on the writer side are worth the cost to everything else that only a few really sophisticated clients can benefit from.

We are not building the "ultimate cloud friendly point cloud format" here. We are building "a LAZ file that can behave reasonably for incremental, spatially segmented remote access".

CCInc commented 3 years ago

Personally, I have not had an issue using ept without adding additional statistics w.r.t potree and other libs. The one exception was gps-time, when implementing my potree ept reader I had to store gps-time min/max in the file metadata.

hobu commented 3 years ago

It feels like there's support for a VLR describing statistics about the entire domain of points (not per-voxel) that is required. I have some questions about its composition:

Must all fixed LAS dimensions be described with entries?
Must all extra bytes dimensions be described?
Should the entry denote whether a dimension's kind is enumeration, continuous or discrete?
What is the meaning of "mean" and "stddev" for a discrete variable or enumeration?

Any other questions? Rendering-type folks, please chime in.

gui2dev commented 3 years ago

+1 Stats can be useful to estimate how many points you'll get for a given spatial request or how they could be distributed. Having more stats and more info might look cool but I can't really see how you could exploit them for point query as you still will have to get all the points in the selected tiles.

abellgithub commented 3 years ago

On Fri, Sep 3, 2021 at 4:00 PM Guilhem Villemin @.***> wrote:

+1 Stats can be useful to estimate how many points you'll get for a given spatial request or how they could be distributed. Having more stats and more info might look cool but I can't really see how you could exploit them for point query as you still will have to get all the points in the selected tiles.

The point counts for cells are already available, so I'm not sure what you're suggesting here.

-- Andrew Bell @.***

gui2dev commented 3 years ago

Maybe I get mixed up by the above discussions about stats per cell, for which I can't clearly see how you could efficiently use them.

Having global stats can give hints for rendering, but maybe you'll need a more per dimension specific description.

I guess that's what @connormanning intended with the buckets, which might work for classification, but won't help much for the intensity example @m-schuetz pointed out.

IMHO @hobu is right, you'll need to have specific description for the kind of data stored in a given dimension.

Buckets works fine for enumerations.

Histogram could work with discrete / continuous data to get an approximate representation of the distribution of the data.

EDIT: histogram is just a collection of buckets of a given centerValue and range

CCInc commented 3 years ago

I am happy with Connor's original proposal. One of the things EPT lacked was codified statistics. As I'm implementing Potree support for COPC, the one thing we definitely need is GPS Time stats, and as Markus said, the stats would be helpful for other dimensions as well.

These statistics would then be stored in the order that the dimensions appear in the point data record format header, followed by statistics for extra-bytes dimensions in the order that they appear.

I think this is the best way to go about this within LAS.

Would it be better to store all the CopcStatistics objects in one VLR?

Must all fixed LAS dimensions be described with entries?

I would say yes, so that the consumer doesn't have to compute which statistics are present and which aren't.

Must all extra bytes dimensions be described?

For simplicity sake, I would again say yes. It's too hard within the LAS spec to pick and choose dimensions.

Should the entry denote whether a dimension's kind is enumeration, continuous or discrete?

Assuming we store statistics for every dimension in order, I think this should be implied based on the knowledge of the header format.

What is the meaning of "mean" and "stddev" for a discrete variable or enumeration?

Not sure about this one.

@hobu @abellgithub Let me know your thoughts on adding this to the spec so we can move forward with the potree side of things.

connormanning commented 3 years ago

We talked about this a lot the last day or two - I think we settled on pretty much this original proposal, with stats entries being required for all dimensions in order including bit fields (e.g. "Classification Flags" are split out into its 4 constituent fields), and importantly removing the buckets portion of the proposal.

While I still see a lot of utility in having these for Classification, it a) complicates the core specification and b) is a half-measure for the functionality you'd really want for binned data counts. For example, for GpsTime or Intensity you probably want to be able to bin histogram data like min: double, max: double, count: int. Maybe that's useful for some discrete dimensions as well. And maybe you'd actually want to be able to do this over multiple bin sizes for the same dimension so your client can choose the best applicable binning resolution, etc. The point is, it gets complicated quickly and I think we'd rather make a more specialized, but optional, VLR that would capture these use-cases, and the others brought up in this thread.

The other part of the discussion was "which dimensions are required to have statistics (and do bit-fields get entries as well)?". We eventually came to "all of them (and yes)". We talked about perhaps requiring only a subset of the ones that "make sense", because as asked above, what exactly does the mean of an enum mean, or the variance of a bit field? However, sometimes the seemingly silly values can be useful.

For example: the mean of the Overlap bit field is the percentage of points classified as overlapping, so it can be considered a metric for how much aerial lidar flight lines overlap. And knowing that min = max = 0 for Withheld tells you that zero points are considered withheld. Who knows what might be useful downstream. So rather than picking some subset of dimensions, this way a client can choose what they care about and not worry about information not being available, and ignore stats that don't make sense to them. Also, all extra-bytes dimensions would be required to have statistics as well.

connormanning commented 3 years ago

Oh and also, remove stddev from this struct. I had intended to remove one of stddev or variance when I originally typed this up, but mistakenly left both of them in. So the struct would consist of min, max, mean, and variance.

wonder-sk commented 3 years ago

Agreed with Connor's suggestions!

As for the use case for visualization in QGIS:

we make use of min/max stats for all dimensions (so having stats just for some dimensions would make things more complicated as we couldn't rely on their presence anyway)
we do not make use of mean/stddev/variance currently, but may use mean and stddev (resp. variance) at some point
we do not make use of buckets and currently do not have plans to use them

Just for a reference, these are the options offered in QGIS for min/max range when visualizing raster layers:

It would be nice to have the same choices for point clouds as well - with the suggested options we would only miss the percentile-based min/max range which helps with the situations with outliers like Markus has mentioned. But different people may want to use different percentiles (90? 95? 98? 99?), which would require more granular entries, so it makes sense to skip all that in the initial spec and possibly introduce some optional advanced dimension stats later if really needed.

CCInc commented 3 years ago

@connormanning Sounds good to me! Would we be able to get a draft into the spec, and I can go ahead and start implementing it within copclib/potree?

We'll need access to it within copc.js as well, if you don't mind updating that.

pierotofy commented 3 years ago

+1 on min/max statistics for the entire domain of points.

I don't think stddev, mean and variance are used much.

copcio / copcio.github.io

Dimension statistics #19