esl / MongooseIM

MongooseIM is Erlang Solutions' robust, scalable and efficient XMPP server, aimed at large installations. Specifically designed for enterprise purposes, it is fault-tolerant and can utilise the resources of multiple clustered machines.
Other
1.64k stars 422 forks source link

Accumulator refactoring - setters, getters and specs #1859

Closed arkgil closed 4 years ago

arkgil commented 6 years ago

The intention of this PR is to introduce more "safe" interface to the accumulator module. While accumulators proved to be very useful when values need to be exchanged between multiple modules or for debugging purposes (e.g. registering hook calls and send results), I believe that they're underused due to lack of confidence when working with them - we never know if an accumulator field is there and what is its value.

My goal is to remove completely the "generic" part of the accumulator - no more get, put and update. All fields should have explicit, documented and typespeced getters and setters.

Changes so far

The code in this PR won't pass any tests - I have refactored maybe 1/3 of the whole accumulator usages. Changes so far include:

Concerns so far

I'd really appreciate anyone's input on these topic.

TODOs and questions for the future

Here is a list of things which I've found that are left to do. Please point out anything that you think is missing here:

Properties

Functions

But is it worth it?

As you can see, these modifications would bring huge changes to the code base. IMO it's worth it, I'd definitely fill more confident when using accumulators, but I'd like us all to decide.

bartekgorny commented 6 years ago

getters currently raise an error when there is no property set.

how about getters with a default value?

How from differs from user and server

I think 'from' is a jid while 'user' and 'server' are parts of it. Such distinction, with the same naming, is used in many places in the old code.

All fields should have explicit, documented and typespeced getters and setters.

Personally I'd prefer pattern matching. Having typespecs is good, but pattern matching is way more readable.

Now, to the point:

  1. Part of the bloat stems from the fact that hook handlers return values by adding them to the accumulator, and there is no naming requirement, so it can be 'result', but can also be 'iq_result' or whatever else, provided the caller knows what it is going to be. Some of them may be reused later, some not.

  2. The problem indeed is that acc api does not impose any logic, so you can store anything you want there. Which, quite understandably, prompted developers of global distribution to escape the mess by storing all their stuff under a separate key. In a while, mod_amp and mod_mam will be doing the same.

  3. To summarise what the acc really stores: a) some hard-wired stuff which never changes or disappears, used mostly for debugging and profiling (ref, timestamp) b) the stanza which started the accumulator c) some values derived from the initial stanza, for easier access d) other attributes denoting the accumulator's origin (from and its derivatives) e) results of hook calls, usually as 'result', volatile f) cached results of some operations, e.g. privacy check g) cached info derived from initial stanza (e.g. iq_query_info) h) additional debugging info, collected along the way (hook_runs, handler_runs, send_result) i) arbitrary values which are stored in one place to be used somewhere else (global_distrib, and mod_amp will certainly do it)

Plus, some of those need to be passed to the other process (not stripped), those are 'persistent'. This is indeed quite a lot of use cases for a single entity, and designing an api to make it clean and consistent is a non-trivial task.

  1. IMO if global_distrib is to be persistent it should be set as a persistent property, not added separately. This is the kind of bloat that current architecture seems to be causing.

  2. return_type is a somewhat hackish way to stop infinite iq error loop. It is tested in a sense that there is a test which triggers the loop, and without the return_type safeguard mongoose would run out of memory and crash. Fixing acc caching logic will solve the deeper cause of the problem, then the return_type will not be needed anymore.

bartekgorny commented 6 years ago

This one may be handy: picture

arkgil commented 6 years ago

getters currently raise an error when there is no property set.

What I don't like about default values is that they don't have semantics. What does the default mean? Also I believe it would allow more relaxed typespecs, because dialyzer would see that any value can be returned by the getter.

I think 'from' is a jid while 'user' and 'server' are parts of it. Such distinction, with the same naming, is used in many places in the old code.

I don't think that's the case, user and server are sometimes set in places where from is not. There is a difference here, but I'm not sure what it is.

Personally I'd prefer pattern matching. Having typespecs is good, but pattern matching is way more readable.

Current implementation of accumulators can be used to pattern match, but I haven't found a single piece of code which uses that in Mongoose. Additionally, allowing to pattern-match defeats the point of opaqueness of the type, in which case, I suppose, we will be using it as a map.

Re: 1. This sentence

it can be 'result', but can also be 'iq_result' or whatever else, provided the caller knows what it is going to be

and this one

Some of them may be reused later, some not.

are exactly the reasons why I'd like to refactor the accumulator. result, or iq_result have no meaning at all or meaning depending hugely on the context they're used in. In addition, I may call some hook, handler puts its result in the result field, and then other handler may override it. Or other hook which is ran after the first one. I, as a caller, have no idea what might be inside.

I think that if hook handlers return a meaningful result, let's make it an explicit accumulator field. Or, let's scope results to hook names, so that different hooks don't override their results.

And as a final note, there are only 4 places in the whole codebase where the result is retrieved from the accumulator.

Re: 2. And that's exactly the point of this PR, to escape the mess 😄

Re: 3

a) But has it ever been used for those purposes?

That point is a really nice summarization 🙂

Re: 4

I'd remove the special persistent_properties at all. In fact, only origin_jid is put there, so why not add it to the list of not-stripped attributes.

Re: 5

Could you point me to the place where this error occures? Has it ever manifested itself? I guess simpler solution is to check if the element's name passed to jlib:make_error_reply is "iq" and that it has "error" type. If I wouldn't know that function, I could pass fresh accumulator there every time and the said infinite loop of errors would occur anyway.

arkgil commented 6 years ago

@bartekgorny that's a dead link.

bartekgorny commented 6 years ago

Link fixed. About pattern matching, I tried to say that I prefer get(attr, Container) over get_attr(Container).

Has it ever manifested itself?

Oh yes, it killed some production instances of MongooseIM. But this is not the scope of this PR, let's discuss it somewhere else.

arkgil commented 6 years ago

I just want to recap what is the higher-level purpose of these changes.

I think, that accumulators in their current shape give too much freedom and carry the level of uncertainty which makes some developers (myself included) very unconfident when working with them. Currently, any value can be put under any key, and that value can be overriden at any time.

With the new shape of the accumulator, every developer when adding new field would need to figure out 4 things:

  1. what is a purpose of that field
  2. what is a type of that field
  3. is that field immutable
  4. is that field always set, or maybe it appears only in some special cases

Those 4 concerns would need to be expressed in code in the following ways:

  1. Documentation for this fields' accessors
  2. Typespecs for this fields' accessors
  3. Implementation of the setter - if the field is immutable ideally there wouldn't be setter for it and it would be added to accumulator only via constructor. That's not always possible (e.g. MAM receives already created accumulator), but setter for immutable field should raise an error when it's set for the second time.
  4. Again, documentation and lack of the setter

Without generic get/2 and set/3 developers would need to at least define accessors. Hopefully, without documentation and typespecs the changes wouldn't pass PR review.

Thanks to this imposed discipline other developers (or the same developer a month later) would have confidence in what is inside the accumulator and whether they can assume that a property is set.

All of this means that the new API can't be generic. All functions need to have specific purpose.

bartekgorny commented 6 years ago

Sounds reasonable. One minor addition:

  1. what is a purpose of that field
  2. what is a type of that field
  3. is that field immutable
  4. is that field always set, or maybe it appears only in some special cases
  1. Should this field be carried on to the other c2s
arkgil commented 6 years ago

@bartekgorny we can't assume that accumulator is in the "c2s" context in the first place. It can be HTTP handler process, or a CLI process.

I prefer get(attr, Container) over get_attr(Container).

Why is that the case? It's the same number of characters 😄

bartekgorny commented 5 years ago

I think we should allow for some non-predefined attrs, for the following reason: one of the nice things about MIM is it's hook/handler based architecture - you can add a custom module with its own hook handlers and register it in config, without modifying core code of MIM. Your handler may need to store something in the acc to be read by another handler. If we impose a strict requirement that every acc variable be predefined in mongoose_acc, we'd throw away this flexibility. Still, I'm all for setting up a dedicated place and maybe api for such custom storage.

erszcz commented 5 years ago

What do you think of accessor naming used by the QT API? The idea is that the getters and setters are called the same and only differ by arity (in case of QT which is in C++ by signature, but close enough). I mean that instead of:

get_some_field(Struct) -> ...

set_some_field(Struct, NewFieldValue) -> ...

we could have

some_field(Struct) -> ...

some_field(Struct, NewFieldValue) -> ...

My impression is that the get_/set_ prefixes carry little to no value, while having to type them is tedious of itself and also makes using autocompletion less convenient (having to type the prefix which is identical for a lot of functions before hitting the magic give me completions button). What do you think?

arkgil commented 5 years ago

@erszcz Yes, I was thinking about it but I went with get/set route so that it's clearly visible what different functions do. I like the idea, though, and I'd go for it.

michalwski commented 5 years ago

Few cents from me:

  1. There is indeed a need for immutable fields, not all of them are known when the acc is created, though. A special setter is needed for that.
  2. Accessing fields through accessors seems like a good idea to me, it for sure simplifies reasoning about the code.
  3. It's good idea to have a hook name tagged handler results, and not use result or iq_result from different hooks.
  4. Some flexibility in acc fields is needed, mainly to make it easier to add a custom hook in not open source code without changing core implementation. This could be done as a misc field which is also a map. In open source code such a field should be empty in my opinion.
  5. The idea with same_field/1 for getting the value and some_field/2 for setting the value seems good to me.
arkgil commented 5 years ago

@michalwski the problem I see with misc field is that everyone will use it instead of adding a dedicated field. This may sound brutal, but in my opinion the discipline here needs to be imposed from the very beginning.

michalwski commented 5 years ago

@arkgil that's why I said that in open source Mongoose this should be empty and probably not used. I just want to make it as easy as possible for anyone using MongooseIM to add their own hook or handler when needed. If they put sth custom in the misc field then it will be easier for them to update their code to MongooseIM's master. If they have to modify core Mongoose code around accumulators then staying up-to-date with open source MongooseIM may be harder for them. That's my main concern.

arkgil commented 5 years ago

@michalwski that makes sense, you're right. Although I still think that usage of this field will be abused :P

kzemek commented 5 years ago

I have one bar that I thing a good accumulator implementation must pass: it works with global distrib and there is no mention of global distrib anywhere in mongoose_acc.erl.

Related: module-specific fields like amp_check_result, privacy_check - should IMO also not be present in good accumulator implementation.

E.g. add a macro so that each mod can put/read things to/from accumulator inside its own namespace (?MODULE?), and other code can explicitly access other namespaces if needed. But I'm of a really strong opinion that the accumulator design cannot both be good and tightly coupled to mods -- choose one.

bartekgorny commented 5 years ago

I still think that usage of this field will be abused

That's what code reviewers are for - to prevent such abuse.

arcusfelis commented 5 years ago

set_user_jid is easier to grep (less noise).

erszcz commented 5 years ago

set_user_jid is easier to grep (less noise)

@arcusfelis Unless you import the function, you'd grep for mongooseim_acc:user_jid, wouldn't you? If you import, you can still grep for import.mongooseim_acc and building a regex alternative for the two is also an option.

bartekgorny commented 5 years ago

One more point for discussion: we want some attrs to be immutable, others append-only, some read-write etc., there is going to be quite a few options. Who is going to decide how a given attr should behave, and how - is it going to be hardcoded in mongoose_acc, or do those attrs have separate apis, or maybe it is an option for the setter?

mongoose_acc:thisattr(Value, Acc)  % thisattr is defined as immutable
or
mongoose_acc:thisattr(Value,  Acc [immutable])
or something else?
fenek commented 5 years ago

My humble suggestion. Personally I'm for flexible API but a one that makes it harder to clutter acc. It means lack of flat namespace. :)

#{
    %% No "global" ref for acc!
    location => #{
        timestamp => Timestamp,
        module => Module,
        function => Function,
        line => Line
    },

    origin => #{
        ref => Ref,
        el => El, %% XML element that triggered the processing
        iq_info => Info,
        jid => Jid
        % ... and some other cached values
    },

    stanza => #{
        ref = Ref2 %% Updated every time when 'el' key is updated
        %% Exactly (or almost) the same schema as origin
    },

    %% Has to be load tested
    hooks => #{
        history => [ {last_hook, result}, {previous_hook, result}, {two_hooks_ago, result} ]
    },

    %% Now time for per-module groups
    %% Maintained by convention, not enforcing API
    mod_global_distrib => #{
        key => val,
        strippable => false
    },
    mod_offline => #{
        stored => true,
        strippable => true
    }
}

%% MACROS:

-define(new_acc(), mongoose_acc:new(#{module => ?MODULE, function => ?FUNCTION_NAME, line => ?LINE})).
-define(new_acc(Element), mongoose_acc:new(#{module => ?MODULE, function => ?FUNCTION_NAME, line => ?LINE, el => Element})).

%% API:

-spec update(Acc :: acc(),
             Group :: term(), % atom?
             KV :: map() | [{K :: term(), V :: term()}]) -> acc().
update(_Acc, location, KV) ->
    error(immutable);
update(Acc, origin, KV) ->
    error_if_any_key_exists_in_this_group;
update(Acc, Group, KV) ->
    update.

update(Group, Key, Val) -> update(Group, [{Key, Val}]).

'get'(Acc, Group, Key)

'get'(Acc, Group, Key, Default)

%% Retain API for counters?

strip(Acc)

%% Further functions...
fenek commented 5 years ago

Discussion conclusions:

  1. Strict setters and getters for predefined information (small set)
  2. to_send element will be accessible with separate functions too (while being optional), to make tracing easier
  3. All mandatory values must be set in constructor.
  4. All remaining values set by constructor are overridable.
  5. Shared properties are immutable
  6. Flexible setters and getters for per-module namespaces [Flexible = get(Acc, NS, Key) instead of get_Key(Acc)]
  7. Module namespaces may be defined as non-strippable
  8. Macros: don’t use them for now; we’ll introduce them if the code looks unreadable

Action item:

Find out which fields should be mandatory.

arcusfelis commented 5 years ago

Often in code we have proplists:get_value(key, List, default). It's useful to have something similar for accs.

fenek commented 4 years ago

Superseded by acc 2.0