Open jiho opened 3 years ago
Need to add in UI the "absent value" marker(s) e.g. 999999 for tot_vol
feature
For biovol, as ellipsoid is more accurate than spherical, algo should probably be:
Which should be generalized as "when there are several options for a computed variable, establish an ordered list of formulae, the first valid computation is the final result"
I guess it's ]0,1] as 0 is not a valid value for the subsampling coefficient.
For biovolume, I assume that pixel_size is 'particle_pixel_size_mm
' at processing level.
tara_oceans_2010_038_d_bongo_300_b 271.0 Volume sampled of the water body Cubic metres http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ http://vocab.nerc.ac.uk/collection/P06/current/MCUB/
tara_oceans_2010_038_d_bongo_300_b 300.0 Sampling net mesh size Micrometres (microns) http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/ http://vocab.nerc.ac.uk/collection/P06/current/UMIC/
tara_oceans_2010_038_d_bongo_300_b 0.283 Sampling device aperture surface area Square metres http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/ http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/
tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 7.114391 Abundance of biological entity specified elsewhere per unit volume of the water body Number per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/ http://vocab.nerc.ac.uk/collection/P06/current/UPMM/
tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 11.636373 Wet weight biomass of biological entity specified elsewhere per unit area of the bed Cubic millimetres per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/ http://vocab.nerc.ac.uk/collection/P06/current/CMCM/
Would need calculation from someone else for verifying...
For biovol, as ellipsoid is more accurate than spherical, algo should probably be:
- If features 'major' and 'minor' are both present and valid values then use them
- Else if feature 'area' is present and valid then use it
- Else cannot compute
Which should be generalized as "when there are several options for a computed variable, establish an ordered list of formulae, the first valid computation is the final result"
People will come up with unforeseen ways to compute the biovolume so the spec above is actually a formula to compute the biovolume of an individual object. Then people choose what they want: spherical, ellipsoid, whatever-their-solution-is. What EcoTaxa does is just sum those at (sub)sample level.
I guess it's ]0,1] as 0 is not a valid value for the subsampling coefficient.
Yes 😁
For biovolume, I assume that pixel_size is '
particle_pixel_size_mm
' at processing level.
Yes
45074
tara_oceans_2010_038_d_bongo_300_b 271.0 Volume sampled of the water body Cubic metres http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ http://vocab.nerc.ac.uk/collection/P06/current/MCUB/ tara_oceans_2010_038_d_bongo_300_b 300.0 Sampling net mesh size Micrometres (microns) http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/ http://vocab.nerc.ac.uk/collection/P06/current/UMIC/ tara_oceans_2010_038_d_bongo_300_b 0.283 Sampling device aperture surface area Square metres http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/ http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/ tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 7.114391 Abundance of biological entity specified elsewhere per unit volume of the water body Number per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/ http://vocab.nerc.ac.uk/collection/P06/current/UPMM/ tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 11.636373 Wet weight biomass of biological entity specified elsewhere per unit area of the bed Cubic millimetres per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/ http://vocab.nerc.ac.uk/collection/P06/current/CMCM/
Would need calculation from someone else for verifying...
Concentration is OK, biovolume is not. Here is how I checked:
Start by exporting from EcoTaxa with "internal ids" option ticked.
Then
library("tidyverse")
d <- read_tsv("ecotaxa_export_397_20210223_1751.tsv")
stats <- d %>%
# compute relevant variables the same way they will be specified within the project settings
mutate(
# subsampling coef (in ]0,1])
subsampling_coef = 1 / acq_sub_part,
# volume, in m3 (here it is already the case)
volume_sampled_m3 = sample_tot_vol,
# organism volume in mm3
organism_volume = 4/3 * pi * (sqrt(object_area/pi) * process_particle_pixel_size_mm)^3
) %>%
# perform the computations EcoTaxa will have to do internally
# "individual" concentration and biovolume (those are kind of meaningless but they allow to simply sum afterwards)
mutate(
individual_concentration = 1 / subsampling_coef / volume_sampled_m3,
individual_biovolume = organism_volume / subsampling_coef / volume_sampled_m3
) %>%
# now sum per sample and taxon
group_by(sample_id, classif_id) %>%
summarise(
concentration=sum(individual_concentration),
biovolume=sum(individual_biovolume)
) %>%
ungroup()
# finally check the target sample + taxon
filter(stats, sample_id == "tara_oceans_2010_038_d_bongo_300_c", classif_id==45074)
Result:
# A tibble: 1 x 4
sample_id classif_id concentration biovolume
<chr> <dbl> <dbl> <dbl>
1 tara_oceans_2010_038_d_bongo_300_c 45074 7.114391 2.486203
And biovolume is : http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/ (not the current "wet weight biomass ...")
Much better now:
tara_oceans_2010_038_d_bongo_300_b 271.0 Volume sampled of the water body Cubic metres http://vocab.nerc.ac.uk/collection/P01/current/VOLWBSMP/ http://vocab.nerc.ac.uk/collection/P06/current/MCUB/
tara_oceans_2010_038_d_bongo_300_b Bongo net Sampling instrument name http://vocab.nerc.ac.uk/collection/L22/current/NETT0176/ http://vocab.nerc.ac.uk/collection/Q01/current/Q0100002/
tara_oceans_2010_038_d_bongo_300_b 300.0 Sampling net mesh size Micrometres (microns) http://vocab.nerc.ac.uk/collection/Q01/current/Q0100015/ http://vocab.nerc.ac.uk/collection/P06/current/UMIC/
tara_oceans_2010_038_d_bongo_300_b 0.283 Sampling device aperture surface area Square metres http://vocab.nerc.ac.uk/collection/Q01/current/Q0100017/ http://vocab.nerc.ac.uk/collection/P06/current/UMSQ/
tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 7.114391 Abundance of biological entity specified elsewhere per unit volume of the water body Number per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/ http://vocab.nerc.ac.uk/collection/P06/current/UPMM/
tara_oceans_2010_038_d_bongo_300_c tara_oceans_2010_038_d_bongo_300_c_45074 2.486203 Biovolume of biological entity specified elsewhere per unit volume of the water body Cubic millimetres per cubic metre http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/ http://vocab.nerc.ac.uk/collection/P06/current/CMCM/
One can find in BO.ProjectVarsDefault.py the hard-coded variables in use today. This issue consists in loading the definitions from the project instead of having a single hardcoded set. Some DB structure addition is needed + management code. Some UI would be relevant as well, and maybe some clever detection from the project data. Considering the possibilities (e.g. wait for UI redone or add to present one?), I put this issue in "to clarify". It might be a good idea as well to split the work in 2 chunks: back-end (specific API or amend an existing one?) and front-end.
While designing the API, care should be taken that more information than just a formulae is needed while exposing the variables. @see https://github.com/ecotaxa/ecotaxa/issues/620#issuecomment-1120359073
The fact that we can extract values from both sample and its subsamples and even objects has a significant impact on the computing code. And it's unclear how to mix, in computation, values from different levels entities.
All the computation is done at object level (i.e. we compute "individual concentrations" or "individual biovolumes") and then can be summed at any aggregation level (acquisition/subsample, sample, project).
So it all starts with a join from object to acq, then from obj+acq to sample. This gives a table with one row per object and the computation proceeds from there.
See the code in https://github.com/ecotaxa/ecotaxa/issues/619#issuecomment-784397190 (comment above) for more precise info.
Hi, My problem probably comes from this (in bold below) part of sentence from https://github.com/ecotaxa/ecotaxa/issues/619#issue-806754032: """ Instead, we decide to add a feature, at project level, that allows to specify which free fields correspond to the standard ones. In project settings, it should be a new section "Identification of standard fields" With two text fields with:
1/something
: to compute a subsampling coef in [0,1] from a subsampling ratio that is a power of 2 (1/32 for a subsmapling at the 32th)something/1000
: to compute a volume in m3 from a volume in Lsomething
part is the name of a free field, at sample or subsample (or object) level.
"""For concentrations, so far we have in code, not generic at all and hard-coded:
If I apply the requirement in bold above, I deduce that potentially, 'tot_vol' could be at subsample level, and even object level. In which case it's not so clear what to do, as I guess that the formulas 'work' only when total sample volume is an input.
My feeling is that BODC terms imply (or specify) a level, and that, e.g:
I think we want BODC quantities as input of the calculations?
Indeed, part of this spec in unclear.
This amounts to sum the number of object within a grouping level: sample or subsample. This is currently implemented in summary export (and DwCA export). Most of the time this is useless however.
http://vocab.nerc.ac.uk/collection/P01/current/SDBIOL01/
The first step is determining what an object in EcoTaxa is representative of in real life. This is determined by the subsampling_coef
, which should be a number in ]0,1]. If every object was imaged, then it is 1. If the sample was split in half it is 0.5. etc. The variables that are used to compute this subsampling coefficient cannot be hardcoded, even per instrument, because they depend on the sample preparation procedure. They may be at sample, subsample or object level and the formula to compute it becomes a property of the project. The project manager should get an UI to specify this in project settings (see the first entry in the issue) and it is then stored in the settings with a syntax : prefix.field
where prefix
is smp
, sub
(or currently acq
or pro
), obj
. API-wise, it is just another entry in the project settings model, following the syntax above and with some consistency checks (i.e. that the field exists).
Then 1/subsampling_coef
is summed within a grouping level. As hinted at in #620 and #616, this grouping level is determined by where water_volume
is specified: sample or subsample. If volume is specified at sample level, it is impossible to compute concentrations at subsample level. The "formula" to compute concentration would be
GROUP_BY sample, category
SUM 1 / subsampling_coef
/ water_volume
When volume is specified at subsample level, it is possible to compute concentration at subsample level using
GROUP_BY subsample, category
SUM 1 / subsampling_coef
/ water_volume
but also at sample level by averaging the concentrations at subsample level[1].
The specification of volume also needs to be a formula, since it needs to be in m3 and is not always stored in m3.
Then, when the data is exported (in summary export, in dwca export), the formulas are read from the project settings and the concentration is computed according to those. The grouping level for the export is determined by the level at which volume is specified : sample if specified at sample, subsample or sample when the volume is specified at subsample level, with a default to subsample. The formula is not specified dynamically at export time so those API endpoints will need little modification.
http://vocab.nerc.ac.uk/collection/P01/current/CVOLUKNB/
The process is almost exactly the same as for concentrations, except that, instead of 1 / subsampling_coef
, one should get object_volume / subsampling_coef
where object_volume
is in mm3 and computed using a formula that is also specified in the project settings, from variables at object, subsample or sample level (typically: area, px_2_mm, etc.).
[1] One could want to compute a weighted average, weighting by the water volume in each subsample, but let us not dive into this for now.
PS: Now that they back and front are separate, this issue could indeed be split between backend + API implementation and front end UI implementation but this is true of many others. I'll get to it, at some point (but in practice, when a feature is not exposed in the UI, it is absent for most users, so the two go hand in hand).
As a complement, the volume (hence the aggregation) should be at sample level for ZooScan, FlowCam; it should be at subsample level for UVP, ISIIS. It can be at both for IFCB depending how the sample/subsample are defined.
I think that the initial comment describing a UI is maybe now incomplete so I try here to summarize the potential additions:
The second point is more impacting in terms of code, as presently the formulas are expressions, e.g. https://github.com/ecotaxa/ecotaxa_back/blob/8217ffe4820a8211f92068d0b8e2089e42f1816b/QA/py/tests/test_export_sci.py#L17 should we need to add conditionals inside it would be more tricky.
I open a shorter issue to summarize Entity-Relationship of present topic, which will impact DB, API and UI.
There will be a simple transformation from UI to send the values in API-compliant form, pls look for 'bodc_variables
' in back-end source.
Add a "equation comment" field to explain the content of the equation.
Add a "equation comment" field to explain the content of the equation.
Hello, do you mean a general help about the fields or a specific comment per formula, entered by the user, for the project?
The second: a text field to serve as a comment, entered by the user. Optional. Typically this will be used to explain the terms of the formula and give a bibliographic reference.
The summary and dwca exports can be in terms of abundances, concentrations, or biovolume.
Abundances # 615 # 626
For abundances, we already have everything we need.
Concentrations #616 #628
For concentrations, we need the
total_water_volume
and thesubsampling_coefficient
. Those are standard BODC termsWe could standardise the name at import and store the data in hard-coded fields, like it is done for latitude, longitude, etc. But this will cause discrepancies between the data that is already there (without those fields) and the new data that is imported. It may also induce redundancy (store the fraction rate as 8 and the subsample coef as 1/8 = 0.125).
Instead, we decide to add a feature, at project level, that allows to specify which free fields correspond to the standard ones. In project settings, it should be a new section "Identification of standard fields" With two text fields with:
1/something
: to compute a subsampling coef in [0,1] from a subsampling ratio that is a power of 2 (1/32 for a subsmapling at the 32th)something/1000
: to compute a volume in m3 from a volume in Lsomething
part is the name of a free field, at sample or subsample (or object) level. Ideally, it should be a sort of badge than one can pick / drag-drop / autocomplete from the list of fields valid for the current project, to avoid typos.A UI could look like this
By default the subsampling coefficient is set to 1.
It should be possible to import the settings from another project into the current one, to vaoid having to re-specify the formulas every time.
Biovolume #617 #629
To compute biovolume, in addition to the fields necessary for concentration, we should have a field for the volume of individual objects.
This field is to be defined as above, in the same section.
The label should be "Individual object volume in mm3" (NB: there is no BODC term for this, for now) and then a formula interface. The classic formulas for this are
The help text should sate: "This should specify a formula to compute the volume of each object in mm3. Classic formulae are equivalent spherical volume = 4/3 pi ( sqrt(area/pi) pixel_size ) equivalent ellipsoidal volume = 4/3 pi (major_axis pixel_size) (minor_axis pixel_size)^2"