More on basic input validation

I anticipate that a lot of user error will be due to giving CAS normalized counts data. Unfortunately, we cannot rely on dtype for that: I have seen too many AnnData files with float32 that contain integer counts. Heck, I do that myself all the time :D

A quick a dirty validation is to sample x percent (x ~ 5 to 10) of non-zero counts counts (easy if sparse, a bit more expensive if dense), and ensure that their decimal is < 1e-3. Otherwise, raise an exception with an informative error message. We can also have a flag to disable input data integralness validation (set to False by default) for those who know what they're doing.

cellarium-ai / cellarium-cas

More on basic input validation #73