arkgil commented 6 years ago

The intention of this PR is to introduce more "safe" interface to the accumulator module. While accumulators proved to be very useful when values need to be exchanged between multiple modules or for debugging purposes (e.g. registering hook calls and send results), I believe that they're underused due to lack of confidence when working with them - we never know if an accumulator field is there and what is its value.

My goal is to remove completely the "generic" part of the accumulator - no more get, put and update. All fields should have explicit, documented and typespeced getters and setters.

Changes so far

The code in this PR won't pass any tests - I have refactored maybe 1/3 of the whole accumulator usages. Changes so far include:

When accumulator is created, module, function and line number need to be passed to new/3,4,5,6. This help immensely when debugging, when we're not sure if accumulator comes from HTTP interface, c2s, external component etc.
The element attribute can be only set as a whole, i.e. the whole record, its name, type and attributes are always in sync. There is a set_element/2 function, but it should be removed and the stanza should be an immutable field, set only when creating an accumulator.
from and to properties can be set only when creating an accumulator, they are immutable due to the lack of setters.
iq_query_info, as previously, is computed lazily and stored in the accumulator and is always in sync with corresponding exml:element() (although that wouldn't be an issue if element was immutable)
server is a regular, mutable property

Concerns so far

I'd really appreciate anyone's input on these topic.

getters currently raise an error when there is no property set. However, sometimes we'd like to perform some logic based on the fact that the property is not there. We can do it in two ways:
- make all getters return {ok, ...} | error. The downside is that we lose nice error message and instead will end up with badmatches in cases when we're willing to crash when property is not there.
- add additional has_ function for each property, but that would make the interface really bloated. The upside is that we retain a nice error message when the property is not there.
element should be completely immutable, it should be set to the stanza entering the node and initiating the "routing chain"
server and user attributes should be completely immutable as well. They should represent the user and server of the user who initiated the current routing chain (in other words, user and server of the current c2s process or HTTP handler or CLI process). Maybe we could name those properties differently?
How from differs from user and server?
to should be completely immutable as well.

TODOs and questions for the future

Here is a list of things which I've found that are left to do. Please point out anything that you think is missing here:

Properties

[ ] result property -
[ ] subscription_lists property - mutable, used by roster_get_subscription_lists hook handlers.
[ ] roster property - filled by modules building roster. It should be mutable, getter should return empty list if the value hasn't been set yet.
[ ] amp_check_result property - also mutable
[ ] privacy_check property - immutable per privacy check cache key, see https://github.com/esl/MongooseIM/blob/master/src/mongoose_privacy.erl#L65
[ ] send_result property - append-only list of records of sending the stanza. I'm not sure if anyone is using that, maybe we should remove it? It's used in https://github.com/esl/MongooseIM/blob/master/src/ejabberd_c2s.erl#L1769, but from what I see its value doesn't matter anyway. It's probably only useful for debugging.
[ ] ref property - immutable, set on creation. Again, I'm not sure if we need it, but AFAIK @fenek has used it in the past
[ ] timestamp property - immutable, set on creation. I don't know anyone who's used it, I guess we could remove it.
[ ] global_distrib property - regular, mutable property
[ ] return_type - used by jlib in https://github.com/esl/MongooseIM/blob/master/src/jlib.erl#L150. To me it could be removed - based on that jlib decides whether IQ passed to the linked function is already an error - but one could pass unrelated accumulator and error IQ there, and the function won't return an error. I saw that these lines are covered by tests, but I believe it comes only from unit tests which call it directly.
[ ] iq_result property. I'm not sure if it has any meaningful usage. I see that it's set in a couple of places in mod_muc_light but it's only retrieved in c2s. Maybe someone know what it's created for.
[ ] offline_messages - append only list, should be removable because c2s clears them
[ ] hook_runs and handler_runs - record fired hooks and handlers, probably useful for debugging. They're not retrieved anywhere.

Functions

[ ] - strip - we should figure out which elements are worth to keep persistent. Currently persistent properties are: ref, timestamp, from, to, global_distrib and persistent_properties (of which at this point only origin_jid is used). Do we want to make this list smaller?

But is it worth it?

As you can see, these modifications would bring huge changes to the code base. IMO it's worth it, I'd definitely fill more confident when using accumulators, but I'd like us all to decide.

bartekgorny commented 6 years ago

getters currently raise an error when there is no property set.

how about getters with a default value?

How from differs from user and server

I think 'from' is a jid while 'user' and 'server' are parts of it. Such distinction, with the same naming, is used in many places in the old code.

All fields should have explicit, documented and typespeced getters and setters.

Personally I'd prefer pattern matching. Having typespecs is good, but pattern matching is way more readable.

Now, to the point:

Part of the bloat stems from the fact that hook handlers return values by adding them to the accumulator, and there is no naming requirement, so it can be 'result', but can also be 'iq_result' or whatever else, provided the caller knows what it is going to be. Some of them may be reused later, some not.
The problem indeed is that acc api does not impose any logic, so you can store anything you want there. Which, quite understandably, prompted developers of global distribution to escape the mess by storing all their stuff under a separate key. In a while, mod_amp and mod_mam will be doing the same.
To summarise what the acc really stores: a) some hard-wired stuff which never changes or disappears, used mostly for debugging and profiling (ref, timestamp) b) the stanza which started the accumulator c) some values derived from the initial stanza, for easier access d) other attributes denoting the accumulator's origin (from and its derivatives) e) results of hook calls, usually as 'result', volatile f) cached results of some operations, e.g. privacy check g) cached info derived from initial stanza (e.g. iq_query_info) h) additional debugging info, collected along the way (hook_runs, handler_runs, send_result) i) arbitrary values which are stored in one place to be used somewhere else (global_distrib, and mod_amp will certainly do it)

Plus, some of those need to be passed to the other process (not stripped), those are 'persistent'. This is indeed quite a lot of use cases for a single entity, and designing an api to make it clean and consistent is a non-trivial task.

IMO if global_distrib is to be persistent it should be set as a persistent property, not added separately. This is the kind of bloat that current architecture seems to be causing.
return_type is a somewhat hackish way to stop infinite iq error loop. It is tested in a sense that there is a test which triggers the loop, and without the return_type safeguard mongoose would run out of memory and crash. Fixing acc caching logic will solve the deeper cause of the problem, then the return_type will not be needed anymore.

bartekgorny commented 6 years ago

This one may be handy: picture

arkgil commented 6 years ago

getters currently raise an error when there is no property set.

What I don't like about default values is that they don't have semantics. What does the default mean? Also I believe it would allow more relaxed typespecs, because dialyzer would see that any value can be returned by the getter.

I think 'from' is a jid while 'user' and 'server' are parts of it. Such distinction, with the same naming, is used in many places in the old code.

I don't think that's the case, user and server are sometimes set in places where from is not. There is a difference here, but I'm not sure what it is.

Personally I'd prefer pattern matching. Having typespecs is good, but pattern matching is way more readable.

Current implementation of accumulators can be used to pattern match, but I haven't found a single piece of code which uses that in Mongoose. Additionally, allowing to pattern-match defeats the point of opaqueness of the type, in which case, I suppose, we will be using it as a map.

Re: 1. This sentence

it can be 'result', but can also be 'iq_result' or whatever else, provided the caller knows what it is going to be

and this one

Some of them may be reused later, some not.

are exactly the reasons why I'd like to refactor the accumulator. result, or iq_result have no meaning at all or meaning depending hugely on the context they're used in. In addition, I may call some hook, handler puts its result in the result field, and then other handler may override it. Or other hook which is ran after the first one. I, as a caller, have no idea what might be inside.

I think that if hook handlers return a meaningful result, let's make it an explicit accumulator field. Or, let's scope results to hook names, so that different hooks don't override their results.

And as a final note, there are only 4 places in the whole codebase where the result is retrieved from the accumulator.

Re: 2. And that's exactly the point of this PR, to escape the mess 😄

Re: 3

a) But has it ever been used for those purposes?

That point is a really nice summarization 🙂

Re: 4

I'd remove the special persistent_properties at all. In fact, only origin_jid is put there, so why not add it to the list of not-stripped attributes.

Re: 5

Could you point me to the place where this error occures? Has it ever manifested itself? I guess simpler solution is to check if the element's name passed to jlib:make_error_reply is "iq" and that it has "error" type. If I wouldn't know that function, I could pass fresh accumulator there every time and the said infinite loop of errors would occur anyway.

arkgil commented 6 years ago

@bartekgorny that's a dead link.

bartekgorny commented 6 years ago

Link fixed. About pattern matching, I tried to say that I prefer get(attr, Container) over get_attr(Container).

Has it ever manifested itself?

Oh yes, it killed some production instances of MongooseIM. But this is not the scope of this PR, let's discuss it somewhere else.

arkgil commented 6 years ago

I just want to recap what is the higher-level purpose of these changes.

I think, that accumulators in their current shape give too much freedom and carry the level of uncertainty which makes some developers (myself included) very unconfident when working with them. Currently, any value can be put under any key, and that value can be overriden at any time.

With the new shape of the accumulator, every developer when adding new field would need to figure out 4 things:

what is a purpose of that field
what is a type of that field
is that field immutable
is that field always set, or maybe it appears only in some special cases

Those 4 concerns would need to be expressed in code in the following ways:

Documentation for this fields' accessors
Typespecs for this fields' accessors
Implementation of the setter - if the field is immutable ideally there wouldn't be setter for it and it would be added to accumulator only via constructor. That's not always possible (e.g. MAM receives already created accumulator), but setter for immutable field should raise an error when it's set for the second time.
Again, documentation and lack of the setter

Without generic get/2 and set/3 developers would need to at least define accessors. Hopefully, without documentation and typespecs the changes wouldn't pass PR review.

Thanks to this imposed discipline other developers (or the same developer a month later) would have confidence in what is inside the accumulator and whether they can assume that a property is set.

All of this means that the new API can't be generic. All functions need to have specific purpose.

bartekgorny commented 6 years ago

Sounds reasonable. One minor addition:

what is a purpose of that field

what is a type of that field

is that field immutable

is that field always set, or maybe it appears only in some special cases

Should this field be carried on to the other c2s

arkgil commented 6 years ago

@bartekgorny we can't assume that accumulator is in the "c2s" context in the first place. It can be HTTP handler process, or a CLI process.

I prefer get(attr, Container) over get_attr(Container).

Why is that the case? It's the same number of characters 😄

bartekgorny commented 5 years ago

I think we should allow for some non-predefined attrs, for the following reason: one of the nice things about MIM is it's hook/handler based architecture - you can add a custom module with its own hook handlers and register it in config, without modifying core code of MIM. Your handler may need to store something in the acc to be read by another handler. If we impose a strict requirement that every acc variable be predefined in mongoose_acc, we'd throw away this flexibility. Still, I'm all for setting up a dedicated place and maybe api for such custom storage.

erszcz commented 5 years ago

What do you think of accessor naming used by the QT API? The idea is that the getters and setters are called the same and only differ by arity (in case of QT which is in C++ by signature, but close enough). I mean that instead of:

get_some_field(Struct) -> ...

set_some_field(Struct, NewFieldValue) -> ...

we could have

some_field(Struct) -> ...

some_field(Struct, NewFieldValue) -> ...

My impression is that the get_/set_ prefixes carry little to no value, while having to type them is tedious of itself and also makes using autocompletion less convenient (having to type the prefix which is identical for a lot of functions before hitting the magic give me completions button). What do you think?

arkgil commented 5 years ago

@erszcz Yes, I was thinking about it but I went with get/set route so that it's clearly visible what different functions do. I like the idea, though, and I'd go for it.

michalwski commented 5 years ago

Few cents from me:

There is indeed a need for immutable fields, not all of them are known when the acc is created, though. A special setter is needed for that.
Accessing fields through accessors seems like a good idea to me, it for sure simplifies reasoning about the code.
It's good idea to have a hook name tagged handler results, and not use result or iq_result from different hooks.
Some flexibility in acc fields is needed, mainly to make it easier to add a custom hook in not open source code without changing core implementation. This could be done as a misc field which is also a map. In open source code such a field should be empty in my opinion.
The idea with same_field/1 for getting the value and some_field/2 for setting the value seems good to me.

arkgil commented 5 years ago

@michalwski the problem I see with misc field is that everyone will use it instead of adding a dedicated field. This may sound brutal, but in my opinion the discipline here needs to be imposed from the very beginning.

michalwski commented 5 years ago

@arkgil that's why I said that in open source Mongoose this should be empty and probably not used. I just want to make it as easy as possible for anyone using MongooseIM to add their own hook or handler when needed. If they put sth custom in the misc field then it will be easier for them to update their code to MongooseIM's master. If they have to modify core Mongoose code around accumulators then staying up-to-date with open source MongooseIM may be harder for them. That's my main concern.

arkgil commented 5 years ago

@michalwski that makes sense, you're right. Although I still think that usage of this field will be abused :P

kzemek commented 5 years ago

I have one bar that I thing a good accumulator implementation must pass: it works with global distrib and there is no mention of global distrib anywhere in mongoose_acc.erl.

Related: module-specific fields like amp_check_result, privacy_check - should IMO also not be present in good accumulator implementation.

E.g. add a macro so that each mod can put/read things to/from accumulator inside its own namespace (?MODULE?), and other code can explicitly access other namespaces if needed. But I'm of a really strong opinion that the accumulator design cannot both be good and tightly coupled to mods -- choose one.

bartekgorny commented 5 years ago

I still think that usage of this field will be abused

That's what code reviewers are for - to prevent such abuse.

arcusfelis commented 5 years ago

set_user_jid is easier to grep (less noise).

erszcz commented 5 years ago

set_user_jid is easier to grep (less noise)

@arcusfelis Unless you import the function, you'd grep for mongooseim_acc:user_jid, wouldn't you? If you import, you can still grep for import.mongooseim_acc and building a regex alternative for the two is also an option.

bartekgorny commented 5 years ago

One more point for discussion: we want some attrs to be immutable, others append-only, some read-write etc., there is going to be quite a few options. Who is going to decide how a given attr should behave, and how - is it going to be hardcoded in mongoose_acc, or do those attrs have separate apis, or maybe it is an option for the setter?

mongoose_acc:thisattr(Value, Acc)  % thisattr is defined as immutable
or
mongoose_acc:thisattr(Value,  Acc [immutable])
or something else?

fenek commented 5 years ago

My humble suggestion. Personally I'm for flexible API but a one that makes it harder to clutter acc. It means lack of flat namespace. :)

#{
    %% No "global" ref for acc!
    location => #{
        timestamp => Timestamp,
        module => Module,
        function => Function,
        line => Line
    },

    origin => #{
        ref => Ref,
        el => El, %% XML element that triggered the processing
        iq_info => Info,
        jid => Jid
        % ... and some other cached values
    },

    stanza => #{
        ref = Ref2 %% Updated every time when 'el' key is updated
        %% Exactly (or almost) the same schema as origin
    },

    %% Has to be load tested
    hooks => #{
        history => [ {last_hook, result}, {previous_hook, result}, {two_hooks_ago, result} ]
    },

    %% Now time for per-module groups
    %% Maintained by convention, not enforcing API
    mod_global_distrib => #{
        key => val,
        strippable => false
    },
    mod_offline => #{
        stored => true,
        strippable => true
    }
}

%% MACROS:

-define(new_acc(), mongoose_acc:new(#{module => ?MODULE, function => ?FUNCTION_NAME, line => ?LINE})).
-define(new_acc(Element), mongoose_acc:new(#{module => ?MODULE, function => ?FUNCTION_NAME, line => ?LINE, el => Element})).

%% API:

-spec update(Acc :: acc(),
             Group :: term(), % atom?
             KV :: map() | [{K :: term(), V :: term()}]) -> acc().
update(_Acc, location, KV) ->
    error(immutable);
update(Acc, origin, KV) ->
    error_if_any_key_exists_in_this_group;
update(Acc, Group, KV) ->
    update.

update(Group, Key, Val) -> update(Group, [{Key, Val}]).

'get'(Acc, Group, Key)

'get'(Acc, Group, Key, Default)

%% Retain API for counters?

strip(Acc)

%% Further functions...

fenek commented 5 years ago

Discussion conclusions:

Strict setters and getters for predefined information (small set)
to_send element will be accessible with separate functions too (while being optional), to make tracing easier
All mandatory values must be set in constructor.
All remaining values set by constructor are overridable.
Shared properties are immutable
Flexible setters and getters for per-module namespaces [Flexible = get(Acc, NS, Key) instead of get_Key(Acc)]
Module namespaces may be defined as non-strippable
Macros: don’t use them for now; we’ll introduce them if the code looks unreadable

Action item:

Find out which fields should be mandatory.

arcusfelis commented 5 years ago

Often in code we have proplists:get_value(key, List, default). It's useful to have something similar for accs.

fenek commented 4 years ago

Superseded by acc 2.0

esl / MongooseIM

Accumulator refactoring - setters, getters and specs #1859

Changes so far

Concerns so far

TODOs and questions for the future

Properties

Functions

But is it worth it?