"Pick one word per concept" is violated in "symbolizing" stuff - Githubissues

JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems

MIT License

55 stars 13 forks source link

"Pick one word per concept" is violated in "symbolizing" stuff #140

Closed Datseris closed 1 year ago

Datseris commented 1 year ago

Hm, I am thinking that we violate the "pick one word per concept" principle. We mix things: alphabet, length, symbols, words, letters, state-space properties...

I think we should decide on one word for the concept "symbol" or "letter" or whatever. While "symbol" may be sometimes confused with Base.Symbol, I guess it is the best option. Alphabet stuff can get confusing, because then you'd expect a symbol of the alphabet to be a letter, or whatever. Then again, we may use the word "event", which doesn't conflict with base and is general enough. Actually, we already use "event" in probabilities_and_events (which by the way we haven't actually listed in the documentation).

Okay, so here is what I propose:

exclusively use the word "event" when referring to the symbols, or letters. Never use the word alphabet, letter, word, or symbol.
Accordingly, rename missing_symbols to missing_events and alphabet_length to total_events.
symbolize becomes just events, since, well, that's what it does, it gives you the events.

What do you think?

p.s.: The pick-one-word-per concept idea isn't originally mine, see "A Handbook of Agile Software Craftmanship"

kahaaga commented 1 year ago

In summary, I agree that we should use established terminology, and I like the idea of using event, because it's already an established term in the context of "probability spaces". Once we're pursuing a more meaningful terminology, we should stick with existing probability-related terminology, i.e. from some reference textbook.

I have a quite lengthy comment/suggestion that I started writing here, but I think it is best summarised as a documentation page outlining our choice of terminology. I'll summarize my suggestions as a PR (not making any actual code changes, just including a new documentation page with some terminology rationale), and we can discuss from there.

kahaaga commented 1 year ago

It is also important that we reach an agreement on this before submitting the JOSS paper.

Datseris commented 1 year ago

offtopic but I don't think we should go for JOSS, we should go for Chaos. Inferior entropy software has been published in Chaos so we most definitely can publish this there. Perhaps do some cross evaluation of some methods on some aspect or whatever.

Datseris commented 1 year ago

we can e.g., use this new "missing patterns surrogates" for all methods and see which performs the best. It sounds like new research and it would take us a day or two.

kahaaga commented 1 year ago

offtopic but I don't think we should go for JOSS, we should go for Chaos. Inferior entropy software has been published in Chaos so we most definitely can publish this there. Perhaps do some cross evaluation of some methods on some aspect or whatever. we can e.g., use this new "missing patterns surrogates" for all methods and see which performs the best. It sounds like new research and it would take us a day or two.

Yeah, I think Chaos is a good choice. With the machinery we've releasing for v2.0, there potential for a lot of new research. So we could potentially frame the paper as a "tool for new research in probabilities and information theoretic methods" (or something cheesy like that), and just supplement with a few use cases, like the missing patterns.

kahaaga commented 1 year ago

@Datseris Ok, I've implemented some suggested changes in pr #141.

In summary:

I vote for using "outcomes" instead of "events". This is rooted in the fact that in probability theory, "outcome" refers to one of the possible outcomes of an experiment, while "event" can be a set of outcomes (something more complex). In this package, we strictly estimate probabilities of elementary outcomes, not compound events consisting of multiple outcomes. I also added a slightly more formal description of this explanation in the "Probabilities" docpage.
Use Discretization as a supertype for OrdinalPattern and GaussianSymbolization, because that's what these are doing - they take some data and discretizes it in some manner.
Explained in the design philosophy in the docstring for ProbabilitiesEstimator.
Cross-reference all probabilities estimators in docstring for ProbabilitiesEstimator.

With these points, I think I've addressed all potential issues I would point out as a reviewer of the software. But then again, much of this comes down to preference.

What do you think?

kahaaga commented 1 year ago

Closed with #141