cran-task-views / ctv

CRAN Task View Initiative
76 stars 13 forks source link

CRAN Task View Proposal: CompositionalData #58

Open statlink opened 10 months ago

statlink commented 10 months ago

Hello,

I would like to propose a new CTV named CompositionalData. The CTV is about packages dedicated to compositional data analysis. The relevant github link is https://github.com/statlink/CompositionalData

Michail Tsagris

dutangc commented 10 months ago

Dear Michail, Your proposal is excellent for me and Compositional data analysis deserves a CRAN task view. Christophe

pkR-pkR commented 10 months ago

The list prepared by Michail includes all packages I know on compositional data analysis and even more. It is really a comprehensive list. I support it.

zeileis commented 10 months ago

Michail, I agree with the others, very nice proposal! I will read a few things in more detail but a couple of quick comments:

statlink commented 10 months ago

Achim thanks for your nice comments.

zeileis commented 10 months ago

Patrice is definitely a good addition, welcome on board. Additionally, it would be good to increase diversity a bit and maybe find two more co-maintainer, ideally a female person and/or someone from a different region/field/application area etc.

For the links: DOIs will be more persistent and always resolve to the journal links (which may change over time). arXiv also added DOIs recently.

statlink commented 10 months ago

Ok, in that case I will add Christophe as well. He is an expert in the CTVs and from a different field, but not a female. I will change the links with the DOIs everywhere, later on today. What shall I do about the books?

matthias-da commented 10 months ago

Dear all,

You are going forward too fast for me. May I read the proposal first and give some suggestions (also regarding possible maintainers) until tomorrow afternoon?

Thank you

Best Matthias

Michail Tsagris @.***> schrieb am Do., 28. Sept. 2023, 09:02:

Ok, in that case I will add Christophe as well. He is an expert in the CTVs and from a different field, but not a female. I will change the links with the DOIs everywhere, later on today. What shall I do about the books?

— Reply to this email directly, view it on GitHub https://github.com/cran-task-views/ctv/issues/58#issuecomment-1738591744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIFCRM5EGYIE4FTKOFULHTX4UOILANCNFSM6AAAAAA5I3MM2Q . You are receiving this because you were mentioned.Message ID: @.***>

zeileis commented 10 months ago

Sure, Matthias, that's perfectly fine. The other CRAN Task View Editors haven't reacted, yet, either.

So we're still in the review process stage and are still collecting feedback.

zeileis commented 10 months ago

Re: Michail.

Christophe is, of course, a great collaborator...but he is already the principal maintainer of two task views. So if possible I would ask you to reach out to other persons in order to distribute the workload better. Moreover, it would also be good to team up with people that really bring in a different perspective who might be aware of packages/activities/etc that you don't know, yet. So I would encourage you to think about potential co-maintainers and then reach out to them.

pkR-pkR commented 10 months ago

Achim,

I cannot speak for Christophe but on my side, I have a rather good experience of the R packages related to compositional data since I work in this field and have tested the top 10 packages. Thanks to my RWsearch package, detecting new packages on CRAN is easy and allowed the Distribution task view to expand from 150 packages in 2018 to 250+ packages in 2023. RWsearch also detected the new isopleuros package (stange name!) that appeared on 2023-05-16 and is still in version 1.

The CoDa community is rather small and the number of packages will not grow so much. If Matthias accepts to co-maintain the task view, we will be 3 persons (excluding Christophe). For ladies, we need contact them one by one. My idea is to make a call at the next CoDa meeting in July 2024 and ask for a (female) volunteer.

https://www.coda-association.org/en/coda-info/news-info/coda-book-applied-compositional-data-analysis/

Patrice

Le 28/09/2023 à 11:37, Achim Zeileis a écrit :

Re: Michail.

Christophe is, of course, a great collaborator...but he is already the principal maintainer of two task views. So if possible I would ask you to reach out to other persons in order to distribute the workload better. Moreover, it would also be good to team up with people that really bring in a different perspective who might be aware of packages/activities/etc that you don't know, yet. So I would encourage you to think about potential co-maintainers and then reach out to them.

— Reply to this email directly, view it on GitHub https://github.com/cran-task-views/ctv/issues/58#issuecomment-1738813797, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBUSNJ27EUKJ7F7MGXQCTLX4VAL5ANCNFSM6AAAAAA5I3MM2Q. You are receiving this because you commented.Message ID: @.***>

zeileis commented 10 months ago

Patrice, thanks for the input! Two quick comments:

statlink commented 10 months ago

Achim hello.

zeileis commented 10 months ago

Thanks for the DOIs!

Regarding the maintenance: Maintainers should pick up new and interesting packages, tutorials, etc. that are relevant for the task view. So it's good to have people with different backgrounds (scientific field, methodology vs. applications, geographical region, etc.) who follow what is going on in the R world from their perspective. Then they will notice different relevant innovations.

statlink commented 10 months ago

In that case can I ask someone from the bioinformatics field to join us? Further, how can we add bioconductor packages, if I find any?

zeileis commented 10 months ago

Bioinformatics sounds good to me. But maybe wait for Matthias in case he has further suggestions.

Adding Bioconductor packages can be done via bioc(...). See: https://github.com/cran-task-views/ctv/blob/main/Documentation.md#main-text

statlink commented 10 months ago

Thanks Achim, I will add more packages from there if I find any.

tuxette commented 10 months ago

Sorry to come in the conversation that late! Thanks for the proposal, which is indeed very useful @statlink I agree that the "Bioiformatics/ecology related packages" (be careful: there is a typo) could maybe be improved. The use of methods explicitly using the compositional nature of the data is the standard in metagenomics and this could be a subsection of this part (I can help you find some packages of interest in addition to the ones that you already cited). For other types of omics, such as sequencing data in general, it is less standard but sometimes useful for some tasks (for instance, the bioconductor package coseq is one of them). The maintainer of this package (Andrea) might be someone to contact to get some help (I think that biostatisticians are more likely to have interesting clues than bioinformaticians in this area but I might be wrong).

tuxette commented 10 months ago

And, in addition, I am far from being an expert but some omics data obtained from spectrometer (proteomics, metabolomics) are also often compositional (you cite some packages related to this in your current proposal) and to my opinion, this should be a different subsection (because, the reason for the compositional nature of the data is very different from metagenomics and other kind of sequencing data).

statlink commented 10 months ago

Tuxette hi. I am not an expert either, and I included all of them in one section because this is a different field to mine.

tuxette commented 10 months ago

I can help you sort this section (if you agree of course). I'll try to do that next week if that works for you?

statlink commented 10 months ago

We need Achim to agree with this also. Because in that case you would have to be a co-maintainer.

zeileis commented 10 months ago

Nathalie @tuxette is a CRAN Task View Editor - like myself. And we help to improve task view proposals while they are under review, so that we can eventually approve them. (See also the proposal guidelines.)

statlink commented 10 months ago

Achim I am happy if she joins us alongside Patrice.

matthias-da commented 10 months ago

Dear all

Thanks for this initiative. Really great to see so many (new) packages in this field and you did a great job finding and listing them. However, I am afraid that such a CTV needs (a lot) further discussions.

In short: I think the current version of the task view needs re-writing almost from scratch. And I strongly recommend asking people from the inner core of the field of compositional data analysis to participate. So in my point, a ctv needs a mixture of enthusiastic guys and guys from the inner circle of the CoDa community.

In long:

  1. Theoretical aspects: I do not agree with the first sentence: "Compositional data are positive multivariate data where the sum of the values of each vector sums to the same constant," since this is a very old and outdated view from the 90ies on the topic and it is not needed. Think of household expenditures or chemical concentrations in ppm of soil samples. Each composition can thus have a different constant and closing to an arbitrary constant like 1 or 100 is not needed not, since anyt multiple representation of a composition is from the same equivalence class. I thus recommend to write instead: "Compositional data are positive multivariate data where the sum of the values of each vector sums to a whole." or similar.

I also disagree with the second sentence: "The most popular approach is to use the logarithm transformation applied to ratios of the variables, initially suggested by Aitchison (1982). However this approach has drawbacks and for this many alternative transformations have been developed throughout the years.", since you meant most probably the additive log-ratio and centered log-ratio, and with many you mean most probably only the isometric log-ratio transformation (plus some "exotic" ones" since most of the other power transformation does not fulfill the principles of compositional data analysis. I would thus recommend: "The most popular approach is to apply a log-ratio analysis, initially suggested by Aitchison (1982)".

From these, you may see that I propose that at least one guy from the inner circle from CoDa should be included in the task view. This could be, e.g. Karel Hron or - in case you need a women: Kamila Facevicova. They could be the CoDa police ;-)

  1. it is not good practice to list your package - which I even did not know by now - at first place in the list of packages. I think those packages that are used most, which are well-known and used by the community, and which include the most variety of methods could be listed first.

  2. I think the general purpose ones are: compositions, robCompositions, Compositional, easyCoDa. (zComposition is special, robCompositions include always non-robust and robust alternatives and a lot of non-robust methods, eg. for the analysis of compositional tables).

  3. The robust ones are (listed by order of methods provided): robComposition, complmrob, robregcc, rrcov3way,

  4. Some topics are missing, e.g. the problem of rounded zeros and structural zeros is one of the most important problems for practitioners in the field. E.g. all data from chemometrics, biomics and chemistry comes with rounded zeros. Package zCompositions and package robComposition can be listed on this subject matter.

  5. I would recommend structuring "Other packages" and Bioinformatics/ecology related packages into more specific fields, e.g. Biomics, Chemometrics, Ecology/Biology. And also give some ideas on how to deal with high-dimensional data.

  6. It would be good to include somebody from those guys from the main packages, because they play a central role in the community and pushed the topic in the last years. This could be Raimon Tolosana-Delgado from the compositions package, it could be myself from the robCompositions package, it could be Javier Palarea-Albaladejo from the zCompositions package. All these guys are well-connected with the compositional community.

  7. A more complicated question is whether one should categorize methods in packages if they fulfill the three principles of a compositional analysis or not. For example, the Dirichlet distribution and Dirgichelt regression is helpful in many situations but it is not a subcompositional coherent method that can trouble you, since dependencies between parts are not modelled in a compositional sense and results can be contradictory to results obtained from a subcomposition taking not all parts into the analysis. I have no answer to this question about distinguishing compositional methods fulfilling the main principles of CoDa and those which relax/violate some of the principles (sometimes for good reasons), but want to point out that you may think about this matter.

  8. The description of packages is often a 1:1 copy from the package description, and thus much too long. As an example: r pkg("ArArRedux"): Processes noble gas mass spectrometer data to determine the isotopic composition of argon (comprised of Ar36, Ar37, Ar38, Ar39 and Ar40) released from neutron-irradiated potassium-bearing minerals. Then uses these compositions to calculate precise and accurate geochronological ages for multiple samples as well as the covariances between them. Error propagation is done in matrix form, which jointly treats all samples and all isotopes simultaneously at every step of the data reduction process. Includes methods for regression of the time-resolved mass spectrometer signals to t=0 ('time zero') for both single- and multi-collector instruments, blank correction, mass fractionation correction, detector intercalibration, decay corrections, interference corrections, interpolation of the irradiation parameter between neutron fluence monitors, and (weighted mean) age calculation. All operations are performed on the logs of the ratios between the different argon isotopes so as to properly treat them as 'compositional data'. --> 1:1 from CRAN https://cran.r-project.org/web/packages/ArArRedux/index.html It is - at least in my point of view - not the aim of the CTV to copy and paste package descriptions, but to shorten them and bring only the main message.

  9. Instead of ternary diagrams (which are of limited use) I would recommend either deleting this section or making a section "Visualisation" (and trying hard to think which other visualizations should be listed), where ternary diagrams are only one of the visualizations.

I am sorry to be such critical because despite being critical, I really look forward to such a task view, but my impression is that the current version needs a lot of discussion and re-writings and also needs people from the inner circle (e.g. some of those I mentioned in my points (1) and (7)).

tuxette commented 10 months ago

Achim I am happy if she joins us alongside Patrice.

I am afraid that would be too much for me but I can help in organizing things with the "bioinformatics" part. However, maybe first, I think that Matthias's comments above have to be accounted for. I agree with most of them (but I am not an expert of of CoDa), especially with comment 6 (which is in line with my previous comment) and also with the fact that the description of packages is too long.

zeileis commented 10 months ago

Matthias @matthias-da, thank you for the thorough feedback, this is very much appreciated...and exactly what I would have hoped for. I agree with Nathalie @tuxette that this feedback should be incorporated first.

Michail @statlink, Matthias' feedback reflects why we push for a diverse team of co-maintainers. What feels completely obvious and natural for some readers might feel awkward for others. So rather than pushing for one side or the other, we try to make the task view accessible for all sorts of different readers from different backgrounds. Hence, establishing a mixed team is a good idea.

pkR-pkR commented 10 months ago

I agree.

Patrice

Le 29/09/2023 à 13:44, Achim Zeileis a écrit :

Matthias @matthias-da https://github.com/matthias-da, thank you for the thorough feedback, this is very much appreciated...and exactly what I would have hoped for. I agree with Nathalie @tuxette https://github.com/tuxette that this feedback should be incorporated first.

Michail @statlink https://github.com/statlink, Matthias' feedback reflects why we push for a diverse team of co-maintainers. What feels completely obvious and natural for some readers might feel awkward for others. So rather than pushing for one side or the other, we try to make the task view accessible for all sorts of different readers from different backgrounds. Hence, establishing a mixed team is a good idea.

— Reply to this email directly, view it on GitHub https://github.com/cran-task-views/ctv/issues/58#issuecomment-1740756496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBUSNKOW2IZUETRUYJS3D3X42YAVANCNFSM6AAAAAA5I3MM2Q. You are receiving this because you commented.Message ID: @.***>

dutangc commented 10 months ago

Thanks @matthias-da .

Regarding your points:

  1. the view would benefit from any contribution, in particular women, but I don’t think that a contributor of the view has to endorse the role of policeman.
  2. do you suggest to create subsections for rounded-zeros or true zeros?
  3. for people who do not know compositional analysis like me, we cannot expected readers of the view to know what are the three principles except if they are stated at the beginning. Is that what you want in the introduction?

regarding your point 3., 5.,..., do you have a proposal for the structure/outline of the view?

matthias-da commented 10 months ago

ad 1. I wrote the police with a ;-) The theory is not that simple in compositional data analysis and I outlined pitfalls in the intro. Thus - in my point of view - the ctv would benefit from someone who has outstanding theoretical knowledge (and is using R in daily business and is from the "inner circle" of the compositional data community). In my point of view, somebody from the "Viennese/Czech group (Peter Filzmoser, Karel Hron, Matthias Templ) or/and from the "German group" (best suited from this group is Raimon Tolosano-Delgado) or/and from the "Girona group" (best suited from this group is Javier Palarea-Albaladejo) should be part of it, at least this would be natural looking at their achievements in the field. This doesn't mean that you are not experts, it's just to have somebody on board with the traditional (log-ratio analysis) view.

ad 2. Personally, I would create one section "Rounded zeros, structural zeros, count zeros and missing values" and make paragraphs for all these issues. One might also give the name "Prepocessing of compositional data" as an alternative.

ad 3. As already written in my point (8), unfortunately, I have no answer to this question, but it should be discussed. Whenever the principles are introduced in the beginning, one probably should give a mark on methods that do not fulfill the three key principles of CoDa (scale invariance, sub-compositional coherence (including subcompositional dominance and ratio preservingness), and permutation invariance). I see this as an open question of how to deal with this. I tend to not discuss this matter in the CTV, because it would involve a deep dive into all methods listed.

ad. regarding the outline: I am not sure about the structure. I see several possibilities. One lists packages according to the type of methods (such as regression methods, compositional tables, robust methods, visualization, high-dimensional data, ...), and the other one lists packages (also) based on applicational fields (such as omics science and bioinformatics, chemometrics, ecology, ...). Personally, I think the main categories should be built based on the kind of methods and there could be some extra sections with very specialized fields (or even subsubmit them in the previous sections and have 8) High-dimensional data as the last section).

Maybe something like this?

  1. General purpose packages
  2. Robust methods
  3. Rounded zeros, structural zeros, count zeros, and missing values
  4. Regression modelling
  5. Functional data analysis and probability density functions
  6. Contingency tables and compositional tables
  7. Visualization (?)
  8. Special applications in Omics science and bioinformatics (?) including high-dimensional data (?)
  9. Special applications in ecology (?)

However, there are other methods like cluster analysis, discriminant analysis and classification methods, principal component analysis, and correlation analysis. Why they would be less important than "regression analysis", for example? So should one extend the above list with another (at least) 4 sections on these methods? And why not also have a section on log-ratio (and other) transformations in the beginning? One problem is also maybe that package compositions and robCompositions, for example, could be listed in almost all sections. I think this all needs further discussion, and I am afraid that it might need time to find a good solution.

Another idea is to have a similar structure on sections like the sections in the books of CoDa:

pkR-pkR commented 10 months ago

Dear Achim, dear all,

I started rewriting the task view this week-end, taking in consideration the useful remarks from Matthias. This is a side activity for me and I plan to complete the new version by the end of the week. Please, give me time. I will also wait for Nathalie suggestions and then add the suggested packages in the task view. Let's wait for the second draft to be completed before we seek for new contributors.

I have to leave and will be the full day out of my office. I will be able to read your remarks only in the evening. Best regards to all.

Patrice Kiener

Le 29/09/2023 à 21:01, Matthias Templ a écrit :

ad 1. I wrote the police with a ;-) The theory is not that simple in compositional data analysis and I outlined pitfalls in the intro. Thus

  • in my point of view - the ctv would benefit from someone who has outstanding theoretical knowledge (and is using R in daily business and is from the "inner circle" of the compositional data community). In my point of view, somebody from the "Viennese/Czech group (Peter Filzmoser, Karel Hron, Matthias Templ) or/and from the "German group" (best suited from this group is Raimon Tolosano-Delgado) or/and from the "Girona group" (best suited from this group is Javier Palarea-Albaladejo) should be part of it, at least this would be natural looking at their achievements in the field. This doesn't mean that you are not experts, it's just to have somebody on board with the /traditional (log-ratio analysis) view/.

ad 2. Personally, I would create one section "Rounded zeros, structural zeros, count zeros and missing values" and make paragraphs for all these issues. One might also give the name "Prepocessing of compositional data" as an alternative.

ad 3. As already written in my point (8), unfortunately, I have no answer to this question, but it should be discussed. Whenever the principles are introduced in the beginning, one probably should give a mark on methods that do not fulfill the three key principles of CoDa (scale invariance, sub-compositional coherence (including subcompositional dominance and ratio preservingness), and permutation invariance). I see this as an open question of how to deal with this. I tend to not discuss this matter in the CTV, because it would involve a deep dive into all methods listed.

ad. regarding the outline: I am not sure about the structure. I see several possibilities. One lists packages according to the type of methods (such as regression methods, compositional tables, robust methods, visualization, high-dimensional data, ...), and the other one lists packages (also) based on applicational fields (such as omics science and bioinformatics, chemometrics, ecology, ...). Personally, I think the main categories should be built based on the kind of methods and there could be some extra sections with very specialized fields (or even subsubmit them in the previous sections and have 8) High-dimensional data as the last section).

Maybe something like this?

  1. General purpose packages
  2. Robust methods
  3. Rounded zeros, structural zeros, count zeros, and missing values
  4. Regression modelling
  5. Functional data analysis and probability density functions
  6. Contingency tables and compositional tables
  7. Visualization (?)
  8. Special applications in Omics science and bioinformatics (?) including high-dimensional data (?)
  9. Special applications in ecology (?)

However, there are other methods like cluster analysis, discriminant analysis and classification methods, principal component analysis, and correlation analysis. Why they would be less important than "regression analysis", for example? So should one extend the above list with another (at least) 4 sections on these methods? And why not also have a section on log-ratio (and other) transformations in the beginning? One problem is also maybe that package compositions and robCompositions, for example, could be listed in almost all sections. I think this all needs further discussion, and I am afraid that it might need time to find a good solution.

Another idea is to have a similar structure on sections like the sections in the books of CoDa:

— Reply to this email directly, view it on GitHub https://github.com/cran-task-views/ctv/issues/58#issuecomment-1741355443, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBUSNPDZGJ5LXOXUFZEI5TX44LHHANCNFSM6AAAAAA5I3MM2Q. You are receiving this because you commented.Message ID: @.***>

zeileis commented 10 months ago

Patrice @pkR-pkR thanks for this! There's no rush.

tuxette commented 10 months ago

@pkR-pkR : No rush indeed. Since Matthias has suggested deep modifications, tell me when you have a first version and I'll make my suggestion on that basis (next week at best probably).

tuxette commented 4 months ago

@pkR-pkR : There is no activity in this discussion since last October. There is no rush but I'm checking if you still plan to submit this proposal?

matthias-da commented 4 months ago

Alternative: I can imagine to completely re-write from scratch this ctv together with Raimon Tolosana-Delgado and Javier Palarea-Albaladejo. Both are experts in compositional data analysis and R and well-known in the community. What do you think?

zeileis commented 4 months ago

From the viewpoint of the CRAN Task View Editors it would be best if the different approaches to this topic could be resolved unanimously - with contributors from both sides! So maybe - now that some time has passed since the original proposal - you can coordinate a revision that you do jointly and that encompasses ideas from both sides?

That would be much preferred over a decision between two different teams of co-maintainers with different ideas.

matthias-da commented 4 months ago

Agree. I offered my participation as well as I listed the other suggestions of potential co-authors in October and it is still surely a good way to do so.

Best Matthias

zeileis commented 4 months ago

Thanks, Matthias, very much appreciated!