Specify correspondance between free fields and standard fields at project level, to compute concentrations and biovolumes

jiho commented 3 years ago

The summary and dwca exports can be in terms of abundances, concentrations, or biovolume.

Abundances # 615 # 626

For abundances, we already have everything we need.

Concentrations #616 #628

For concentrations, we need the total_water_volume and the subsampling_coefficient. Those are standard BODC terms

sub-sampling coefficient = http://vocab.nerc.ac.uk/collection/P01/current/SSAMPC01/1/ no unit, in [0,1]
volume sampled of the water body = http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ in m3 This information often exists in the metadata but is within free fields and may not be in the appropriate unit/format.

We could standardise the name at import and store the data in hard-coded fields, like it is done for latitude, longitude, etc. But this will cause discrepancies between the data that is already there (without those fields) and the new data that is imported. It may also induce redundancy (store the fraction rate as 8 and the subsample coef as 1/8 = 0.125).

Instead, we decide to add a feature, at project level, that allows to specify which free fields correspond to the standard ones. In project settings, it should be a new section "Identification of standard fields" With two text fields with:

a label which is the name of the BODC term, shown as a link to the BODC page, and the unit in square brackets
a text field which allows to specify a formula that involves free fields. This formula can be:
- 1/something : to compute a subsampling coef in [0,1] from a subsampling ratio that is a power of 2 (1/32 for a subsmapling at the 32th)
- something/1000 : to compute a volume in m3 from a volume in L
- etc. The something part is the name of a free field, at sample or subsample (or object) level. Ideally, it should be a sort of badge than one can pick / drag-drop / autocomplete from the list of fields valid for the current project, to avoid typos.

A UI could look like this

By default the subsampling coefficient is set to 1.

It should be possible to import the settings from another project into the current one, to vaoid having to re-specify the formulas every time.

Biovolume #617 #629

To compute biovolume, in addition to the fields necessary for concentration, we should have a field for the volume of individual objects.

This field is to be defined as above, in the same section.

The label should be "Individual object volume in mm3" (NB: there is no BODC term for this, for now) and then a formula interface. The classic formulas for this are

esd = 2 * sqrt(area/pi) * px_size
vol_spherical = 4/3 * pi * (esd/2)^3
vol_ellipsoid = 4/3 * pi * (major/2 * px_size) * (minor/2 * px_size)^2

The help text should sate: "This should specify a formula to compute the volume of each object in mm3. Classic formulae are equivalent spherical volume = 4/3 pi ( sqrt(area/pi) pixel_size ) equivalent ellipsoidal volume = 4/3 pi (major_axis pixel_size) (minor_axis pixel_size)^2"

grololo06 commented 3 years ago

Need to add in UI the "absent value" marker(s) e.g. 999999 for tot_vol feature

grololo06 commented 3 years ago

For biovol, as ellipsoid is more accurate than spherical, algo should probably be:

If features 'major' and 'minor' are both present and valid values then use them
Else if feature 'area' is present and valid then use it
Else cannot compute
Which should be generalized as "when there are several options for a computed variable, establish an ordered list of formulae, the first valid computation is the final result"

grololo06 commented 3 years ago

I guess it's ]0,1] as 0 is not a valid value for the subsampling coefficient.

grololo06 commented 3 years ago

For biovolume, I assume that pixel_size is 'particle_pixel_size_mm' at processing level.

grololo06 commented 3 years ago

tara_oceans_2010_038_d_bongo_300_b      271.0   Volume sampled of the water body    Cubic metres    http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/    http://vocab.nerc.ac.uk/collection/P06/current/MCUB/
tara_oceans_2010_038_d_bongo_300_b      300.0   Sampling net mesh size  Micrometres (microns)   http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/    http://vocab.nerc.ac.uk/collection/P06/current/UMIC/
tara_oceans_2010_038_d_bongo_300_b      0.283   Sampling device aperture surface area   Square metres   http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/    http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/
tara_oceans_2010_038_d_bongo_300_c  tara_oceans_2010_038_d_bongo_300_c_45074    7.114391    Abundance of biological entity specified elsewhere per unit volume of the water body    Number per cubic metre  http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/    http://vocab.nerc.ac.uk/collection/P06/current/UPMM/
tara_oceans_2010_038_d_bongo_300_c  tara_oceans_2010_038_d_bongo_300_c_45074    11.636373   Wet weight biomass of biological entity specified elsewhere per unit area of the bed    Cubic millimetres per cubic metre   http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/    http://vocab.nerc.ac.uk/collection/P06/current/CMCM/

Would need calculation from someone else for verifying...

jiho commented 3 years ago

For biovol, as ellipsoid is more accurate than spherical, algo should probably be:

If features 'major' and 'minor' are both present and valid values then use them

Else if feature 'area' is present and valid then use it

Else cannot compute

Which should be generalized as "when there are several options for a computed variable, establish an ordered list of formulae, the first valid computation is the final result"

People will come up with unforeseen ways to compute the biovolume so the spec above is actually a formula to compute the biovolume of an individual object. Then people choose what they want: spherical, ellipsoid, whatever-their-solution-is. What EcoTaxa does is just sum those at (sub)sample level.

I guess it's ]0,1] as 0 is not a valid value for the subsampling coefficient.

Yes 😁

For biovolume, I assume that pixel_size is 'particle_pixel_size_mm' at processing level.

Yes

jiho commented 3 years ago

45074

tara_oceans_2010_038_d_bongo_300_b        271.0   Volume sampled of the water body    Cubic metres    http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/    http://vocab.nerc.ac.uk/collection/P06/current/MCUB/
tara_oceans_2010_038_d_bongo_300_b        300.0   Sampling net mesh size  Micrometres (microns)   http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/    http://vocab.nerc.ac.uk/collection/P06/current/UMIC/
tara_oceans_2010_038_d_bongo_300_b        0.283   Sampling device aperture surface area   Square metres   http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/    http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/
tara_oceans_2010_038_d_bongo_300_c    tara_oceans_2010_038_d_bongo_300_c_45074    7.114391    Abundance of biological entity specified elsewhere per unit volume of the water body    Number per cubic metre  http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/    http://vocab.nerc.ac.uk/collection/P06/current/UPMM/
tara_oceans_2010_038_d_bongo_300_c    tara_oceans_2010_038_d_bongo_300_c_45074    11.636373   Wet weight biomass of biological entity specified elsewhere per unit area of the bed    Cubic millimetres per cubic metre   http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/    http://vocab.nerc.ac.uk/collection/P06/current/CMCM/

Would need calculation from someone else for verifying...

Concentration is OK, biovolume is not. Here is how I checked:

Start by exporting from EcoTaxa with "internal ids" option ticked.

Then

library("tidyverse")
d <- read_tsv("ecotaxa_export_397_20210223_1751.tsv")
stats <- d %>% 
  # compute relevant variables the same way they will be specified within the project settings
  mutate(
    # subsampling coef (in ]0,1])
    subsampling_coef = 1 / acq_sub_part,
    # volume, in m3 (here it is already the case)
    volume_sampled_m3 = sample_tot_vol,
    # organism volume in mm3
    organism_volume = 4/3 * pi * (sqrt(object_area/pi) * process_particle_pixel_size_mm)^3
  ) %>%
  # perform the computations EcoTaxa will have to do internally
  # "individual" concentration and biovolume (those are kind of meaningless but they allow to simply sum afterwards)
  mutate(
    individual_concentration = 1 / subsampling_coef / volume_sampled_m3,
    individual_biovolume = organism_volume / subsampling_coef / volume_sampled_m3
  ) %>% 
  # now sum per sample and taxon
  group_by(sample_id, classif_id) %>% 
  summarise(
    concentration=sum(individual_concentration),
    biovolume=sum(individual_biovolume)
  ) %>% 
  ungroup()

# finally check the target sample + taxon
filter(stats, sample_id == "tara_oceans_2010_038_d_bongo_300_c", classif_id==45074)

Result:

# A tibble: 1 x 4
  sample_id                          classif_id concentration biovolume
  <chr>                                   <dbl>         <dbl>     <dbl>
1 tara_oceans_2010_038_d_bongo_300_c      45074      7.114391  2.486203

jiho commented 3 years ago

And biovolume is : http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/ (not the current "wet weight biomass ...")

grololo06 commented 3 years ago

Much better now:

tara_oceans_2010_038_d_bongo_300_b      271.0   Volume sampled of the water body    Cubic metres        http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/    http://vocab.nerc.ac.uk/collection/P06/current/MCUB/
tara_oceans_2010_038_d_bongo_300_b      Bongo net   Sampling instrument name        http://vocab.nerc.ac.uk/collection/L22/current/NETT0176/    http://vocab.nerc.ac.uk/collection/Q01/current/Q0100002/    
tara_oceans_2010_038_d_bongo_300_b      300.0   Sampling net mesh size  Micrometres (microns)       http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/    http://vocab.nerc.ac.uk/collection/P06/current/UMIC/
tara_oceans_2010_038_d_bongo_300_b      0.283   Sampling device aperture surface area   Square metres       http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/    http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/
tara_oceans_2010_038_d_bongo_300_c  tara_oceans_2010_038_d_bongo_300_c_45074    7.114391    Abundance of biological entity specified elsewhere per unit volume of the water body    Number per cubic metre      http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/    http://vocab.nerc.ac.uk/collection/P06/current/UPMM/
tara_oceans_2010_038_d_bongo_300_c  tara_oceans_2010_038_d_bongo_300_c_45074    2.486203    Biovolume of biological entity specified elsewhere per unit volume of the water body    Cubic millimetres per cubic metre       http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/    http://vocab.nerc.ac.uk/collection/P06/current/CMCM/

grololo06 commented 2 years ago

One can find in BO.ProjectVarsDefault.py the hard-coded variables in use today. This issue consists in loading the definitions from the project instead of having a single hardcoded set. Some DB structure addition is needed + management code. Some UI would be relevant as well, and maybe some clever detection from the project data. Considering the possibilities (e.g. wait for UI redone or add to present one?), I put this issue in "to clarify". It might be a good idea as well to split the work in 2 chunks: back-end (specific API or amend an existing one?) and front-end.

grololo06 commented 2 years ago

While designing the API, care should be taken that more information than just a formulae is needed while exposing the variables. @see https://github.com/ecotaxa/ecotaxa/issues/620#issuecomment-1120359073

grololo06 commented 2 years ago

The fact that we can extract values from both sample and its subsamples and even objects has a significant impact on the computing code. And it's unclear how to mix, in computation, values from different levels entities.

jiho commented 2 years ago

All the computation is done at object level (i.e. we compute "individual concentrations" or "individual biovolumes") and then can be summed at any aggregation level (acquisition/subsample, sample, project).

So it all starts with a join from object to acq, then from obj+acq to sample. This gives a table with one row per object and the computation proceeds from there.

See the code in https://github.com/ecotaxa/ecotaxa/issues/619#issuecomment-784397190 (comment above) for more precise info.

grololo06 commented 2 years ago

Hi, My problem probably comes from this (in bold below) part of sentence from https://github.com/ecotaxa/ecotaxa/issues/619#issue-806754032: """ Instead, we decide to add a feature, at project level, that allows to specify which free fields correspond to the standard ones. In project settings, it should be a new section "Identification of standard fields" With two text fields with:

a label which is the name of the BODC term, shown as a link to the BODC page, and the unit in square brackets
a text field which allows to specify a formula that involves free fields. This formula can be:
- 1/something : to compute a subsampling coef in [0,1] from a subsampling ratio that is a power of 2 (1/32 for a subsmapling at the 32th)
- something/1000 : to compute a volume in m3 from a volume in L
- etc. The something part is the name of a free field, at sample or subsample (or object) level. """

For concentrations, so far we have in code, not generic at all and hard-coded:

water_volume at Sample level, in 'tot_vol' free var
subsampling_coef at Acquisition AKA Subsample level, derived from 'sub_part'

If I apply the requirement in bold above, I deduce that potentially, 'tot_vol' could be at subsample level, and even object level. In which case it's not so clear what to do, as I guess that the formulas 'work' only when total sample volume is an input.

My feeling is that BODC terms imply (or specify) a level, and that, e.g:

computing http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ from sample.tot_vol
differs from
computing http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ from subsample.subs_vol (if ever exists...)

I think we want BODC quantities as input of the calculations?

jiho commented 2 years ago

Indeed, part of this spec in unclear.

Abundance

This amounts to sum the number of object within a grouping level: sample or subsample. This is currently implemented in summary export (and DwCA export). Most of the time this is useless however.

Concentration

http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/

The first step is determining what an object in EcoTaxa is representative of in real life. This is determined by the subsampling_coef, which should be a number in ]0,1]. If every object was imaged, then it is 1. If the sample was split in half it is 0.5. etc. The variables that are used to compute this subsampling coefficient cannot be hardcoded, even per instrument, because they depend on the sample preparation procedure. They may be at sample, subsample or object level and the formula to compute it becomes a property of the project. The project manager should get an UI to specify this in project settings (see the first entry in the issue) and it is then stored in the settings with a syntax : prefix.field where prefix is smp, sub (or currently acq or pro), obj. API-wise, it is just another entry in the project settings model, following the syntax above and with some consistency checks (i.e. that the field exists).

Then 1/subsampling_coef is summed within a grouping level. As hinted at in #620 and #616, this grouping level is determined by where water_volume is specified: sample or subsample. If volume is specified at sample level, it is impossible to compute concentrations at subsample level. The "formula" to compute concentration would be

GROUP_BY sample, category
SUM 1 / subsampling_coef
/ water_volume

When volume is specified at subsample level, it is possible to compute concentration at subsample level using

GROUP_BY subsample, category
SUM 1 / subsampling_coef
/ water_volume

but also at sample level by averaging the concentrations at subsample level[1].

The specification of volume also needs to be a formula, since it needs to be in m3 and is not always stored in m3.

Then, when the data is exported (in summary export, in dwca export), the formulas are read from the project settings and the concentration is computed according to those. The grouping level for the export is determined by the level at which volume is specified : sample if specified at sample, subsample or sample when the volume is specified at subsample level, with a default to subsample. The formula is not specified dynamically at export time so those API endpoints will need little modification.

Biovolume

http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/

The process is almost exactly the same as for concentrations, except that, instead of 1 / subsampling_coef, one should get object_volume / subsampling_coef where object_volume is in mm3 and computed using a formula that is also specified in the project settings, from variables at object, subsample or sample level (typically: area, px_2_mm, etc.).

[1] One could want to compute a weighted average, weighting by the water volume in each subsample, but let us not dive into this for now.

jiho commented 2 years ago

PS: Now that they back and front are separate, this issue could indeed be split between backend + API implementation and front end UI implementation but this is true of many others. I'll get to it, at some point (but in practice, when a feature is not exposed in the UI, it is absent for most users, so the two go hand in hand).

jiho commented 2 years ago

As a complement, the volume (hence the aggregation) should be at sample level for ZooScan, FlowCam; it should be at subsample level for UVP, ISIIS. It can be at both for IFCB depending how the sample/subsample are defined.

grololo06 commented 1 year ago

I think that the initial comment describing a UI is maybe now incomplete so I try here to summarize the potential additions:

From https://github.com/ecotaxa/ecotaxa_front/issues/619#issuecomment-779728273 : Should we add this "unknown" value for volume, or add it in another form, or just ignore and assume the values are always valid (present state of the code)
From https://github.com/ecotaxa/ecotaxa_front/issues/619#issuecomment-784377143 : Should we add the possibility to have complex/multiple formulas for a given quantity or stick to a simple/single one ? (present state of the code)

The second point is more impacting in terms of code, as presently the formulas are expressions, e.g. https://github.com/ecotaxa/ecotaxa_back/blob/8217ffe4820a8211f92068d0b8e2089e42f1816b/QA/py/tests/test_export_sci.py#L17 should we need to add conditionals inside it would be more tricky.

I open a shorter issue to summarize Entity-Relationship of present topic, which will impact DB, API and UI.

grololo06 commented 1 year ago

There will be a simple transformation from UI to send the values in API-compliant form, pls look for 'bodc_variables' in back-end source.

jiho commented 1 year ago

Add a "equation comment" field to explain the content of the equation.

grololo06 commented 1 year ago

Add a "equation comment" field to explain the content of the equation.

Hello, do you mean a general help about the fields or a specific comment per formula, entered by the user, for the project?

jiho commented 1 year ago

The second: a text field to serve as a comment, entered by the user. Optional. Typically this will be used to explain the terms of the formula and give a bibliographic reference.

ecotaxa / ecotaxa_front