NCEAS / eml

Ecological Metadata Language (EML)
https://eml.ecoinformatics.org/
GNU General Public License v2.0
40 stars 15 forks source link

Consider adding documentation for how to deal with log-units #283

Open amoeba opened 6 years ago

amoeba commented 6 years ago

This came up on Slack today, what do we fill in when an attribute is a log-transform of another? The consensus was to create a custom unit called log{unit}, e.g., meter -> logmeter that is dimensionless. I couldn't find any guidance on this in the docs so I thought we might want to add a note or two so at least a Ctrl+F for "log" or "transform" would pop something up for the curious user.

mobb commented 6 years ago

thanks, Bryce. maybe someone could mine the EML to see what people are already doing with log units. In general, we need guidelines for many dimensionless units, and attributes that are severely reduced.

amoeba commented 6 years ago

Good idea! This would take a bit of work but is fairly doable. A good stepping off point would be https://cn.dataone.org/cn/v2/query/solr/?q=attribute:*log*&fl=attribute

mobb commented 6 years ago

Thanks for the query @amoeba . bummer that we don't have the semantics in place to remove the attrs that are about trees.

But to be perfectly correct, it seems to be (expressing values that are the result of log transforms) that they are dimensionless: https://math.stackexchange.com/questions/238390/units-of-a-log-of-a-physical-quantity

But I think this recommendation fits what we see in environmental data: https://www.reddit.com/r/askscience/comments/1x09zc/what_happens_to_the_units_of_a_number_after/cf72xlk

So we should state that the log (or ln) is dimensionless, but the attribute description can state the original unit, which no longer have meaning - because you can't subtract or add the numbers as you originally would have.

amoeba commented 6 years ago

So we should state that the log (or ln) is dimensionless, but the attribute description can state the original unit, which no longer have meaning - because you can't subtract or add the numbers as you originally would have.

👍

mpsaloha commented 6 years ago

Hi,

I think there are some interesting points being discussed here, and I'm trying to straighten this out in my head...

For me:

  1. "Dimension" refers to the type of the (typically, physical) variable of interest-- e.g. Mass, Length, Time, etc

  2. "Measure" or "Measurement" refers to the quantification of a variable of interest, that presumably is_of some "Dimension" (although this can get murky once we depart from basic physical variables)

  3. "Units" are defined to serve as standards for comparability of "Measurements" within some "Dimension"

  4. A "Measurement" becomes comparable with other "Measurements" when its measured "Value" is expressed as a ratio to some fundamental, standard "Unit", e.g. a "Meter" (thus, saying something is "3.14 meters" is really like saying it is "3.14 times the length of the 1 meter standard- leaving aside for now quantum physics and speed of light issues relative to quantifying a meter)

  5. Thus, the Dimension of a Measurement is not changed simply because the scale for expression of its Value has been "Transformed" algorithmically (e.g. logarithmic transform). And in this case that transformation is reversible. The"alteration/transformation" was done on the "Value"-- and does not impact the "Dimension" (e.g. if Meter becomes LogMeter, Dimension remains Length)

  6. The Unit, however, must be restated to provide proper interpretation of the associated Value. (hence, a log transform on measurements expressed in meters would have Units of LogMeter, or maybe we need to have some "transformation Units"-- for the log/ln, trigonometric, hyperbolic, and other potential transforms, as these transformations can be applied to most any Unit, notwithstanding issues with ZERO or negative numbers.

This is what makes greatest sense to me. The logarithm is of the "Value" of the "Length", not the "Length" Dimension itself.

So the statement that "values are dimensionless" is true, but values are associated with Measurements quantifying some Dimension, that remains unchanged.

The fundamental nature (Dimension) of a variable of interest would not change simply because its associated measured values are transformed, whether linear (e.g. Meters -> Feet); or non-linear (Meters -> LogMeters).

I also wanted to reiterate that we should not confuse "Dimensions" in this context, with the use of "Dimensional Analysis" to cancel out units through some division or multiplication process, as (and we've often discussed this in the past) we often need to preserve the specific identity (type) of the variable of interest in many cases. Time is a special case that often can be cancelled out.

To circle back, I think the Dimension of LogMeter should be "Length" (my point #6 above)

cheers, Mark

On Tue, Aug 14, 2018 at 1:02 PM, Bryce Mecum notifications@github.com wrote:

So we should state that the log (or ln) is dimensionless, but the attribute description can state the original unit, which no longer have meaning - because you can't subtract or add the numbers as you originally would have.

👍

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/283#issuecomment-412996599, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-VqXbci5_NxXWdAY-onIzLtjzgcxks5uQyzagaJpZM4R6BlT .

mobb commented 6 years ago

Tried to summarize the slack discussion. something like this, for the EML documentation (feel free to edit):

If an attribute is a log transform, it can be unitless ("dimensionless" is a standardUnit in EML). If it is useful to include a version of the original unit for labeling, the customUnit can reflects the original dimensions, e.g., "logMeter", or "lnPa". However, the definition for a customUnit for a transformed value (in STMML) should state that it's relation to a parent is through an inverse transformation, and describe the transform, e.g., exp(x); STMML assumes simple arithmetic.

mpsaloha commented 6 years ago

This sounds good to me, though we should consider two things:

  1. "dimensionless" should not remain a standardUnit in EML, as a value can be "unit-less" (e.g. Box-Cox), but still represent a "dimension" (e.g. Mass, Length). I recommend we revise the name of "dimensionless" to "unit-less", to preserve the important distinction between Dimension and Unit

  2. We should remember there are some other common data transformations aside from Log/Ln, including (primarily) SqRt, CubRt, Arcsine, Reciprocal, Box-Cox, and Regression. So we might want to develop a general method to accommodate such cases.

On Tue, Aug 14, 2018 at 5:19 PM, mobb notifications@github.com wrote:

Tried to summarize the slack discussion. something like this, for the EML documentation (feel free to edit):

If an attribute is a log transform, it can be unitless ("dimensionless" is a standardUnit in EML). If it is useful to include a version of the original unit for labeling, the customUnit can reflects the original dimensions, e.g., "logMeter", or "lnPa". However, the definition for a customUnit for a transformed value (in STMML) should state that it's relation to a parent is through an inverse transformation, and describe the transform, e.g., exp(x); STMML assumes simple arithmetic.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/283#issuecomment-413056315, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-U9LVxxUc7dTcQmNaMEJSQ4iMFDeks5uQ2kjgaJpZM4R6BlT .

mobb commented 6 years ago

Down-voting my own comment, above. Trying to cram all this into a single EML "unit" is a bad idea. Logs are dimensionless by definition, and a unit implies that certain operations can be performed, which is misleading. A better recommendation for describing a log measurement will be to use the annotation field.

mobb commented 6 years ago

comment from @mpsaloha regarding how to handle Units for TRANSFORMED DATA:

Interpretation of Units or Dimensions can be problematic after data are transformed for statistical purposes. Some transformations can be completely reversed to re-derive original values, although caution must be exercised if constants or other adjustments were made to the data beforehand. Expressing both the nature of the transform ("transform_type"), as well as the original unit (if any) associated with a measurement, can, often provide invaluable information.

EML should recommend a convention for expressing transformed attribute values, e.g. transform_type[original_unit] and provide some standardized abbreviations for popular transformations, and mechanisms for constructing the above format.

Examples: Log[Meter] SqRt[Count] ...etc...

Transforms to consider for providing standardized prefixes in EML include: Log, Ln, SqRt, CuRt, Arcsin, Box-Cox

Construction of an EML customized unit, as proposed above, should not be taken to indicate that the "original unit" is still associated with the transformed value. Rather, it indicates what that original unit was, for improved evaluation of data for re-use, as well as the potential for implementing a reverse transformation to re-derive the original data (although this should be done cautiously).

@mpsaloha will write some text, after @mobb finds the spot.

mobb commented 6 years ago

@mpsaloha - There are two places where the documentation could be augmented:

  1. At the top-level documentation of eml-attribute: which currently shows up here: https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/index.html and https://knb.ecoinformatics.org/#external//emlparser/docs/eml-2.1.1/./eml-attribute.html

Content is in this file (second paragraph, section starting approx line 70): https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-attribute.xsd

  1. and in the documentation for <customUnit> itself: same file, approx line 900. There is no URL directly to that documentation; all you can do is go to the attribute.html page (above) and search in page for customUnit. It's about half way down.

If you want, put the text here and I'll add it, since I have that file out now.

mbjones commented 6 years ago

While I am fine with clarifying the math behind the use of logs, sin, exp, and other transcendental functions, I would like us to be clear that it is not possible mathematically to take the log of a dimensioned quantity with units. The idea of a "log meter" is nonsensical mathematically. Rather, people often use a shorthand that assumes the arguments to transcendental functions have first been made dimensionless before the function is evaluated. There are numerous explanations of this on the web. Here are a couple of decent ones, the first of which is the most comprehensive, and points out that several popular internet sites like Wikipedia have promulgated mistakes in some of the math, including the use of the Taylor expansion as justification one way or the other:

The math stack exchange site also trots out some of these erroneous explanations. A simple and intuitive way to show that log(10 grams) is nonsensical is to see what happens to it when expanding it. Take the definition of the log function (using base 10 log as an example, but its true for all bases): y = log(x) if x = 10^y. Then examine the following expansion:

log(10 grams) = log(10 * 1 gram)
              = log(10) + log(gram)
              = 1 + log(gram)

From the paper linked above, then to calculate log(gram) one must ask yourself "what is the exponent y (a number) to which one should raise the base b, that will yield gram(s)?" There is no such number, as gram is not a number.

The way textbooks get away with using dimensioned numbers as arguments to transcendental functions is to (implicitly) divide by a reference constant first (e.g., ln(3 m) is really ln(3 m/1 m) to make the units cancel, which is ln(3). All arguments to transcendental functions must be dimensionless numbers, even though sometimes people don't make that explicit.

So, if people want to make a new STMML definition for logmeter in EML as another name for dimensionless and that has unitType=dimensionless then that is fine. It would clarify that the original unit was meter. But let's not imply that the value of a transcendental function has a unit. It is a pure number, and does not have units.

mpsaloha commented 6 years ago

Hi Matt,

I agree with your point, and hopefully it is completely clear that I don't think anybody (much less me) in this discussion has been advocating that a log-transformed value "retains its original unit". And I think we also agree that logarithmic values in general can have units and dimensions, e.g. decibels, pH, and astronomical magnitude do...And those all involved logarithmic transformations of some measured physical quantity, that is supposed to make them unit-less according to some folks-- but apparently we can then usefully "invent" Unit names to associate with values of log-transformed data of specific types; aha precedents!

You suggest:

So, if people want to make a new STMML definition for logmeter in EML as another name for dimensionless and that has unitType=dimensionless then that is fine. It would clarify that the original unit was meter. But let's not imply that the value of a transcendental function has a unit. It is a pure number, and does not have units.

==== I think you are suggesting as a solution

unitType=logmeter

and

unitType=logmeter === unitType=dimensionless

I guess that will work since the key thing I am concerned about is knowing those original Units, and it seems you are okay with that. I think you are betraying some of the mathematical arguments you cite, however-- e.g. Matta et al. or the stackexchange advocates for "dimensionless-ness" of log-transformed data. Once you lose those Units through the dimensional analysis necessary to "permit" taking logarithms (LOG FUNCTIONS MUST HAVE UNIT-LESS ARGUMENTS!!), aren't they "gone"? :-)

I prefer, however, the syntax of Log[meter] rather than "logmeter", as the latter seems to have stronger connotations that it is, well, referring to a chimerical "log-meter"...

Use of brackets also more clearly separates the name of the transformation from the original unit in which the data were represented.

Finally, I am still concerned about our synonymizing 'unitless" with ' dimensionless'. I don't think these are the same thing. "Dimensions" describe the physical variable measured. Thus, while log-transformed measurements of wing-length might be unit-less, I would argue they retain their dimension of "length". If it is possible to revise EML to accommodate this distinction, I think it would be well advised.

cheers, Mark

On Sun, Aug 19, 2018 at 2:39 PM, Matt Jones notifications@github.com wrote:

While I am fine with clarifying the math behind the use of logs, sin, exp, and other transcendental functions, I would like us to be clear that it is not possible mathematically to take the log of a dimensioned quantity with units. The idea of a "log meter" is nonsensical mathematically. Rather, people often use a shorthand that assumes the arguments to transcendental functions have first been made dimensionless before the function is evaluated. There are numerous explanations of this on the web. Here are a couple of decent ones, the first of which is the most comprehensive, and points out that several popular internet sites like Wikipedia have promulgated mistakes in some of the math, including the use of the Taylor expansion as justification one way or the other:

The math stack exchange site also trots out some of these erroneous explanations. A simple and intuitive way to show that log(10 grams) is nonsensical is to see what happens to it when expanding it. Take the definition of the log function (using base 10 log as an example, but its true for all bases): y = log(x) if x = 10^y. Then examine the following expansion:

log(10 grams) = log(10 * 1 gram) = log(10) + log(gram) = 1 + log(gram)

From the paper linked above, then to calculate log(gram) one must ask yourself "what is the exponent y (a number) to which one should raise the base b, that will yield gram(s)?" There is no such number, as gram is not a number.

The way textbooks get away with using dimensioned numbers as arguments to transcendental functions is to (implicitly) divide by a reference constant first (e.g., ln(3 m) is really ln(3 m/1 m) to make the units cancel, which is ln(3). All arguments to transcendental functions must be dimensionless numbers, even though sometimes people don't make that explicit.

So, if people want to make a new STMML definition for logmeter in EML as another name for dimensionless and that has unitType=dimensionless then that is fine. It would clarify that the original unit was meter. But let's not imply that the value of a transcendental function has a unit. It is a pure number, and does not have units.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/283#issuecomment-414157857, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-Y42zWe6Ctg2jUurtPIavLcNHUngks5uSdsMgaJpZM4R6BlT .

mbjones commented 6 years ago

You wrote:

while log-transformed measurements of wing-length might be unit-less, I would argue they retain their dimension of "length"

This is where we might diverge. My reading has lead me to think that log-transformed values are indeed dimensionless -- e.g., they no lo longer represent a 'length' -- and rather now represent a pure numerical value. Here's the relevant quote from the Massa et al paper:

By definition, transcendental functions such as logarithm (to any base), exponentiation, trigonometric functions, and hyperbolic functions act upon and deliver dimensionless numer- ical values.

And also you wrote:

logarithmic values in general can have units and dimensions, e.g. decibels, pH, and astronomical magnitude do

I think pH, dB, the richter scale, and other similar indices are dimensionless and do not have units. The Massa paper also uses pH as an example, where they show that it is a log of a ratio in which the dimensions cancel, specifically it is the log of a ratio of concentrations for which the dimensions and units cancel. Similar with dB.

cboettig commented 6 years ago

I believe that the logarithm of length in meters would technically be considered a level measurement, that is, of type "level" or "level difference" rather than of type "length" or of type "dimensionless".

mpsaloha commented 6 years ago

Hi Matt and Carl,

Carl-- thanks for finding that. But did you notice that in the link you provided they refer to decibels under "Units of Level"?

Matt-- yes, I read the Matta et al. paper and liked their argument for dismissal of the Taylor expansion as "proof" of dimensionlessness in the case of logarithms, but noted that they also never mentioned the issue of inverse transformations in the case of logarithms-- which is a common use case.

So, we don't see eye-to-eye on several things here:

in general, what a "dimension" represents as opposed to a "unit"-- I don't think "dimensionless" is the same as "unitless"-- while a measurement value with its unit allows us to infer its dimension, the reverse is not true (a measured value with its dimension does not allow us to infer its unit-- as we well know from under-specified metadata! "Body weight of 5": dimension of Mass; units of ??)

that if one log-transforms a set of wing-lengths (e.g. measured in cm) it becomes a pure-number, so the inverse transform of those pure-numbers are also pure-numbers (i.e. dimension (length) and unit (cm) of those measurements are irretrievably lost. Note that analysts routinely re-derive original values and their associated units from statistically transformed variables-- how is this defendable if log-transforms are "pure numbers"?)

that pH, dB and other logarithmic scaled measurements are unitless. For example, I'd assert that 10 is unitless, but that 10dB has a unit of decibel, which is a measurement of the log ratio of amplitude of two "sounds" (air pressure levels) or other energy sources. If you want to call 'dB' (as an example) something other than a unit, maybe we need to invent a new category-- "unitless standard" for these standard names for interpreting and comparing quantitative values along some scale (which coincidentally is the primary function of those thingies we call "units"). So, regardless of what we call these, I think retaining them somewhere in the metadata, rather than letting them drift away in pure number bliss.

Also, we are promoting different notions of "dimensionlessness"-- yours having more to do with dimensional analysis, and mine more regarding semantics. E.g. if one has 100Kg of antelopes per 5Kg of Lions, I'd say the dimensions are "Mass"; whereas you (I think) would say this ratio is dimensionless.

cboettig commented 6 years ago

@mpsaloha Yeah, decibels are a particularly interesting case. Apparently decibel is technically the log ratio of any measurement, so arguably the 'units' of logarithm of length could be decibels! Wikipedia suggests the convention is to put the unit following decibels, so decibels of log voltage would be dBV. (ironically dBm apparently refers to log base of milliwatts, sorry meters). Apparently the SI standard opposes this convention.

To make this more confusing, decibels are defined differently for power-type units and "field" (now called "root-power") type measurements, where it is typical to square the values before taking the ratio (equivalently, multiplying the log by 2), see: https://en.wikipedia.org/wiki/Field,_power,_and_root-power_quantities).

so decibel-meters, anyone?

Not sure I'm helping. pH is a little cleaner as technically it's already defined as the log of H+ activity, which is already defined as a dimensionless measure, so the use of logs does not imply the need for a reference scale.

There is some argument that these log-scaled units are quantities we tend to think of in percentage/multiplicative terms anyway, and measure in log-scale units....

mpsaloha commented 6 years ago

Hi Carl,

Yes, these issues aren't trivial, but sometimes I feel like we are dealing more here with Zeno's Achilles and Tortoise paradox rather than anything else. Alternatively, are we flogging a dead something, and simply not reaching consensus on what it should be called? :-)

Do we at least agree that qualifying log-transformed values (must be "pure numbers"!)-- as "decibels", "dB" or "dBV, or "pH" (certainly in common understanding at least a scale if not a reference scale?!)-- with these standard "unit-like suffices", can enrich our interpretation of log-based, dimensionless, "pure numbers"?

(I'm not going to mention again the thought experiment about how an inverse transform on log-transformed data could enable us to "regain" original units from allegedly "pure numbers", but only if we somehow preserve the information about those original units)

By the way, I subsequently discussed this issue with my brother-in-law who is a math professor, and he suggested, after paragraphs about derivatives and various scenarios involving transcendental functions, that this was more a "kind of a philosophical problem" rather than a call for mathematical purity. Admittedly, he is a low dimensional topologist, so this area is not his expertise-- but he has taught advanced calculus for 30 years. He suggested the view that "units" are useful by convention, and that we might consider adopting some (notational) convention ourselves, and explaining it well. Which is more or less along the line of what I've been advocating- rather than LOSING invaluable information about those raw values being quantified in {meters, Kilograms, Counts, etc} because one can only take their logarithms after their units are dimensionally cancelled out and "lost"(?) to yield a dimensionless number. This approach does call for care from the "data re-user"determining how those values can be algebraically combined with other measurements. But there are also some common, straightforward use cases (e.g. again, the possibility of reacquiring units from an inverse transform on a log value).

So even if we agree that these buggers might have "log-scale units", I think in many cases it will also be useful to know the original units on the measured value that was log-transformed. I've suggested some ways to do that -- e.g. Log[Kg] seems to be a standard syntax for communicating relevant information in a graphical axis label that 1) the values are log-transformed and 2) the original measurements were taken in Kg. I think this is quite a common and highly interpretable way of representing the nature of the data values for this particular use case.

Well, sometimes it is useful to repeat arguments in different ways to hopefully add clarity or come closer to consensus. At this point, though, I'm not sure what or who will determine our way forward, although Matt, you, and Margaret have certainly done the lion's share of work on the EML revision.

thanks! Mark

On Tue, Aug 21, 2018 at 8:01 PM, Carl Boettiger notifications@github.com wrote:

@mpsaloha https://github.com/mpsaloha Yeah, decibels are a particularly interesting case. Apparently decibel is technically the log ratio of any measurement, so arguably the 'units' of logarithm of length could be decibels! Wikipedia suggests https://en.wikipedia.org/wiki/Decibel the convention is to put the unit following decibels, so decibels of log voltage would be dBV. (ironically dBm apparently refers to log base of milliwatts, sorry meters). Apparently the SI standard opposes https://en.wikipedia.org/wiki/Decibel#Suffixes_and_reference_values this convention.

To make this more confusing, decibels are defined differently for power-type units and "field" (now called "root-power") type measurements, where it is typical to square the values before taking the ratio (equivalently, multiplying the log by 2), see: https://en.wikipedia.org/wiki/Field,_power,_and_root-power_quantities).

so decibel-meters, anyone?

Not sure I'm helping. pH is a little cleaner as technically it's already defined as the log of H+ activity, which is already defined as a dimensionless measure, so the use of logs does not imply the need for a reference scale.

There is some argument that these log-scaled units are quantities we tend to think of in percentage/multiplicative terms anyway, and measure in log-scale units....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/eml/issues/283#issuecomment-414892913, or mute the thread https://github.com/notifications/unsubscribe-auth/AE61-fPxW8-XQf6-fuPBjChLz68hzJOjks5uTMmIgaJpZM4R6BlT .

mobb commented 6 years ago

Summary from @mpsaloha via email: "BTW-- after getting input from two mathematicians (Profs), both more-or-less agree with me: log is a transform on the value, and not the unit. And the unit should be preserved (somehow) for utility-- such as when do inverse transform."

It's the 'somehow' that we want to explain in the EML documentation. my opinion:

brunj7 commented 6 years ago

Mark mentioned this thread to me --- so here are my 2 cents:

I do not agree that a log transform of a number removes neither its associated unit nor dimension. If the number is a number of something, the log of this number is still of something.

100 km = 10^2 km = 100,000 m= 10^5 m = log(10^5) m = 5log10 m = 2log10 km

It is important since you can invert the transformation and get the original number (of something or not) back.

So the unit is still the same after a log transform, but we need to find a way to save the information that the stored values in the data file are in a log scale.

mbjones commented 5 years ago

@brunj7 Your "equation" commits the fundamental mistakes that are outlined in the Matta et al. paper (https://pubs.acs.org/doi/pdf/10.1021/ed1000476) that I linked to in my comment above. I suggest that a deep read and understanding of that paper is required before we can make headway on this issue. I propose that we remove this issue from the EML 2.2 release given that we have not reached consensus in the last year and a half on the issue. I will bump this issue to the 3.0.0 milestone unless others object and can show a mechanism for consensus to be very quickly reached.

mobb commented 5 years ago

related to #323