JuliaEarth / CoDa.jl

Compositional data analysis in Julia
MIT License
59 stars 8 forks source link

Handling zero parts in a composition #9

Closed juliohm closed 3 years ago

juliohm commented 4 years ago

As discussed in a recent PR, it seems that we should think more seriously about the corner case when a part of a composition is zero. Various operations with log fail, and one possible solution is to always add a eps(). Alternatively, we could consider adding this eps() at the composition constructor.

cc: @mralbu

mralbu commented 4 years ago

I have attempted to address this question at constructor-zero-plus-eps. I have also added the closure operation to the Composition constructor. This behaviour mimics the acomp function of the R compositions package. Do you think these are sensible propositions?

Behaviour:

julia> Composition(0, 1, 2)               
   part-1 ┤ 7.401486830834377e-17                    
   part-2 ┤■■■■■■■■■■ 0.3333333333333333             
   part-3 ┤■■■■■■■■■■■■■■■■■■■■ 0.6666666666666666 

Benchmarks: Original (without checking for zero valued components)

julia> @benchmark Composition(Tuple(i for i in 0:99))
BenchmarkTools.Trial: 
  memory estimate:  32.55 KiB
  allocs estimate:  514
  --------------
  minimum time:     35.302 μs (0.00% GC)
  median time:      36.345 μs (0.00% GC)
  mean time:        39.184 μs (3.37% GC)
  maximum time:     2.344 ms (97.40% GC)
  --------------
  samples:          10000
  evals/sample:     1

Proposed:

julia> @benchmark Composition(Tuple(i for i in 0:99))
BenchmarkTools.Trial: 
  memory estimate:  37.48 KiB
  allocs estimate:  525
  --------------
  minimum time:     42.989 μs (0.00% GC)
  median time:      44.125 μs (0.00% GC)
  mean time:        47.095 μs (2.67% GC)
  maximum time:     2.223 ms (96.87% GC)
  --------------
  samples:          10000
  evals/sample:     1
juliohm commented 3 years ago

It would be nice if we could research what other projects have done regarding this corner case with zero parts. Should we take a look at R compositions and similar Python projects before we commit to a specific approach?

On a second thought, I think it would be useful to preserve the zero parts and handle them explicitly when needed inside the operations with algorithm. What do you think?

Thanks for pushing this forward.

On Sat, Sep 19, 2020, 23:16 mralbu notifications@github.com wrote:

I have attempted to address this question at constructor-zero-plus-eps https://github.com/mralbu/CoDa.jl/blob/constructor-zero-plus-eps/src/composition.jl#L45 . I have also added the closure operation to the Composition Closure. This behaviour mimics the acomp function of the R compositions package. Do you think these are sensible propositions?

Behaviour: julia> Composition(0, 1, 2) part-1 ┤ 7.401486830834377e-17 part-2 ┤■■■■■■■■■■ 0.3333333333333333 part-3 ┤■■■■■■■■■■■■■■■■■■■■ 0.6666666666666666

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaEarth/CoDa.jl/issues/9#issuecomment-695511900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3LKGWMGV5OSLTSFK6DSGVQZVANCNFSM4RLBHCJA .

mralbu commented 3 years ago

The R compositions package use different S3 classes and single dispatch methods for different types of compositional data rcomp, acomp, rplus, aplus, ccomp and rmult. Just as a disclaimer, the experience and applications I have with the R compositions package only deal with the acomp relative compositional class. Missing values are treated differently depending on their root cause, which are classified according to the classes described in missing.compositions. BDL, bellow detection limit, and SZ, structural zero, are some of the important cases. Zero replacing functions are not applied by the constructor acomp.

library(compositions)

x = c(0.0, 0.2, 0.8)

acomp(x)
[1]  BDL  0.2  0.8
attr(,"class")
[1] acomp

Applying inverse transforms on transformed data may result in inconsistent behavior.

ilrInv(ilr(acomp(x)))
[1] 0.2857143 0.1428571 0.5714286
attr(,"class")
[1] acomp

Which may be prevented using the function compositions::zeroreplace

ilrInv(ilr(
  compositions::zeroreplace(acomp(x), .Machine$double.eps)       
))
[1] 1.480297e-16 2.000000e-01 8.000000e-01
attr(,"class")
[1] acomp

scikit-bio also offers a multiplicative_replacement function for zero replacement.

Do you think a zeroreplace! function, manually applied, would be a better design choice? What are your thoughts on the available design options?

juliohm commented 3 years ago

First off, that is a really nice review. Thanks for putting the effort and sharing these references.

The R compositions package use different S3 classes and single dispatch methods for different types of compositional data rcomp, acomp, rplus, aplus, ccomp and rmult.

I find the R design quite "tabular" and not very helpful to compose with other projects. The R view of the world usually starts with a table, but CoDa.jl tries to build compositions in a principled way without tables involved. This allows us to have tables where a single column is a composition and where other columns are completely different things.

Just as a disclaimer, the experience and applications I have with the R compositions package only deal with the acomp relative compositional class.

I feel the same. I don't have enough experience in compositional data analysis to argue against their multiple classes, but it seems to me that the whole point of sticking to Aitchison's geometry goes away when we start to relax the ratio-scale and the total-ness of the compositions. That is why when I created this package I only considered a single composition type with all the benefits of Aitchison's work. I may be proven wrong in the future, and would love to learn the reasons why people need the other classes.

So if we concentrate our attention to the acomp type, which matches our Composition type, it seems that they do two things to handle zeros. First, they make the constructor "detect" the presence of values below a threshold by default (the BDL). Second, they use a theory of proportional missing to handle these missings. From your second link:

Martin-Fernandez, J.A., C. Barcelo-Vidal, and V. Pawlowsky-Glahn (2003) Dealing With Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation. Mathematical Geology, 35(3) 253-278

They also provide a mechanism to convert back and forth between zeros that were interpreted as missing by default (BDL) and a small value that is not zero and that wouldn't trigger errors in downstream calculations.

Based on this review, I think we should consider dealing with missing values more explicitly in CoDa.jl. I am still digesting the information before proposing an improvement, but please let me know if you have a proposal already that you would like to share. From what I understood, their different "missing" classes are just an artifact of supporting multiple composition classes. If we stick to the Composition type we have here we could handle a single missing type, which is Julia's missing.

What is not clear to me yet is the conversion between zero "missing" and the small value obtained with zeroreplace. If it is something introduced just to not overflow the calculations with logarithm, then we could handle the machine eps in the calculations directly without changing the constructor and without introducing a function just for that. However, if it is something more fundamental, then we need to brainstorm a bit more before committing to a design.

I will try to handle #8 during the week, and will also try to add support for missing components there. That is, I will change the internal representation of the composition to a named tuple where the parts can be missing. That will be a start before we properly handle missing values.

mralbu commented 3 years ago

One case that may be of particular interest for geostatistical computations is that of the Structural Zero, where a zero component really may have to be represented as zero. For example, to interpolate facies proportions of rock samples acquired in coordinates (x, y), some of the samples may indeed have zero percentage of a given or several facies. For this zero component type, which I think fits the description of Structural Zero, the representation of components as being missing may not represent the real situation that that component is not present.

juliohm commented 3 years ago

Yes, I think we have two cases: an actual zero that we need to carry on as is and apply the eps-strategy inside the operations, and an actual missing informed by the user directly as Composition(a=1,b=2,c=missing) or read from a CSV with readcoda where the user specifies the string of missing values as NA, NaN or any other symbol of interest.

juliohm commented 3 years ago

@mralbu I've invited you to the organization so that we can improve on these details more quickly. Appreciate if you can read the guidelines and make your membership public :+1:

juliohm commented 3 years ago

I've handled zero parts in the master branch following the eps strategy you've introduced with the ilr transformation.

Besides the zeros, we can create compositions with missing parts now:

Composition(a=1.0, b=2.0, c=missing)

We can brainstorm imputation of missing parts in a separate issue. It will probably just be a pre-processing step on a collection of composition objects. Thanks for the help so far and for pushing the use case. Looking forward to seeing it working.