civicrm / org.civicrm.api4

CiviCRM api version 4
Other
8 stars 19 forks source link

Design Principles: add "Massively scalable" #18

Closed ErichBSchulz closed 7 years ago

ErichBSchulz commented 7 years ago

So I think this means caching, asynchronous, supporting parallelisation with an eye to supporting tools like rabitmq and redis to break beyond the constraints of a LAMP stack)

But I really have no true detailed feel of what this looks like... but the need to add this principle in is one of the biggest concerns I have with rushing to a MVP (#13) because I am worried that it may impose limitations or need patterns that require breaking earlier patterns

(edit: the vision is to support DBs with a million+ contacts and busy nation-wide teams - these organisations are currently turning to commercial alternatives and taking their resources to fund dev work on civi with them!)

eileenmcnaughton commented 7 years ago

Like I mentioned on the other ticket - with the test contract in place we have quite a lot of scope to keep working on improving the inner workings.

I think there has been some discussion of offering services for caching etc - great ideas! I would still leave it for phase 2

ErichBSchulz commented 7 years ago

my concern is that we may discover things in the API that don't scale well that are hard to wrap without breaking the contract.

my reading of massive scalability suggests that supporting asychronous operations are vital as well as ability to access caches. It is possible I'm jumping at shadows here because of ignorance but this (API4) just feels like a good time to pause and think about:

and making sure we dont "enter into contracts" that are going to make progress harder. I guess it is the need to support asynchronous operation has me most apprehensive... ignorance breeds fear :-)

eileenmcnaughton commented 7 years ago

The contract deals with inputs & outputs & signature. So caching can change without the contract changing. Paging is also something where the inputs & outputs should be able to be static & predictable. Currently you pass in a combination of 'limit' & 'offset' for paging - can you think of situations that cannot be described by that..

I can see queuing as api actions - both to set up queues & to process them - but I think those are extra functions around other actions no?

async calls - not sure - we currently do that by having an ajax interface with the api. It feels like something you build on top of an api.

ErichBSchulz commented 7 years ago

tbh all this stuff feels very woolly in my head - I feel a bit blind and all I can sense is shadows... and a memory of stubbed toes in other lives

it maybe be bigger/more subtle than breaking the contract

if we build an API with no sense of queuing, async & caching (QAC ??) (heh) then extension (and core) devs wont be thinking about how to write code that fits into these patterns - which will mean a bunch of stuff will just not scale

this reminds of "writing code that is testable" - it has taken me 3 years to really come to grips with what "writing code that is testable" means. I spent forever writing code that was Not Unit Testable - I had no concept of how to do it. I think I do now (and after a few years if feels very natural), but retrofitting tests onto my early code still wasn't possible without some hairy refactoring...

I'm thinking that creating patterns that QAC (heh) properly is going to be similar (or maybe not??)

I've discovered a few anti-QAC patterns but I suspect there are more:

  1. when doing a get for example - do we need an up to the second data set or can we use the one from 5 minutes ago? or the one from the overnight cron, or last weeks set? or last months?
  2. when getting a report can we accept that this is always async and specify a priority?
  3. when specifying gets I discovered that in order to make the results cacheable they have to be specified in absolute terms not relative - ie "in last 3 weeks" doesn't cache nicely but "between 20160201 and 20160108" does... "my contacts" doesn't cache, but "contacts allocated to #78240" does cache

I know QAC patterns exist already in CiviCRM (mailings, smart-groups, ACL cache) - I'm wondering if we can pause, abstract, consider best practice ('cause I have no clue) then make it foundation for V4 (so that even if not fully implemented the API signature will allow scaling)

if we specify it as a goal we can help flush out the expertise to and people can help point out the gotchas??

eileenmcnaughton commented 7 years ago

Those are good examples.

In terms of the api signature they feel like 'add-ons' - we can add new parameters onto the api signature at any time.

In terms of how to actually implement them - I assume point 3 refers to mysql caching? I'm not quite sure though - can you explain that point further.

Re point 2 you mean a priority for mysql - like setting the priority on a query. That feels like it might be quite easy - like for the api there would aways be an option for 'mysql_priority' - it would default to normal but if set it would be appended to the mysql statements.

Point 1 - once again at the api level we would support an option that describes the caching and it would be appended to the get query code - obviously the trickier part here is building the caching mechanism - but adding it to the api signature doesn't seem so hard

ErichBSchulz commented 7 years ago

My concern with having it as an add on is that 95% of extension coders and ??50% of core devs will use the simplest options (and not specify a "max cache age" and probably more importantly avoid async calls need a completely different pattern allow for asynchronous calls - since async calls, of course, are a real bending PITA until you hit the pattern) (not that I'm suggesting for a second that async call need to be standard but bringing in the option will enable a long journey to begin)

caching could use any cache engine eventually - and in a docker-cluster microservice constellation deploying queue and cache operations starts to look quite feasible - using mysql (because it is available - not because anyone recommends using an rdbms as a cache), redis, the file system or any other cache storage.

But coming back to API4 - the point with this case was to consider "do we want to put some effort into making an API that will scale to millions of records and 100s/1000s of concurrent users"? - ie a one-liner patch in the Design Principles section.

If this is something we think is worth exploring then the next step would be consider what this means for the API signature - and how an async interface could work - (hopefully this work could draw on non-core team members expertise)

eileenmcnaughton commented 7 years ago

" would be consider what this means for the API signature - and how an async interface could work -"

Yes - that is the key - does the signature need to change. I can't personally see how it would which is why I haven't worried so much about that side of it. I agree that making new caching mechanism opt in means it would not be so used at first - but since I expect it to go through a more experimental phase I think that is OK.

Can you explain how you see the signature looking different for these things?

ErichBSchulz commented 7 years ago

so for adding to queuing with priority and accessing a cache then this would be a case of just some extra parameters

and adding (priority, max_cache_age) to the signature is harmless - but should these be part of the API from the the outset? - even if they are ignored by the kernel in the first iteration?

but in order to queue then that mean async call - so that mean the calling function needs a call back or to return a promise

googling... this, and this and this

I know nada about rabbit mq but it smells good and has php sdks - i need to read more...

but key question for this issue is "is this something we feel we need to think about more?"

if we decide yes we can update the readme then splinter of issues for "prioritisation", "caching" and "queuing/aysychronous operations" - and maybe others?

eileenmcnaughton commented 7 years ago

"but should these be part of the API from the the outset? - even if they are ignored by the kernel in the first iteration?"

I guess the reason you would consider adding them from the start is if they were going to be implemented in such a way that they would not be backwards compatible if implemented later. Probably by the time it is BETA that will make itself more clear

"but in order to queue then that mean async call - so that mean the calling function needs a call back or to return a promise"

I think maybe you need to spend some time looking at the js api wrapper we have. That does include callbacks & promises and works well for async calls.

JoeMurray commented 7 years ago

Microservices architectures are designed to massively scale; currently CiviCRM is not. Certain assumptions get baked in when you want a system to scale massively: you'll likely need to be able to separate your reporting into a different database than the one used for writes. You might also need to use a data pump pattern from the db system used for writes to one used for reads, leading to a lag of seconds to minutes. You might want to use a db system that scales out rather than up, leading to a choice that accepts only eventual consistency in the database copies. There is much more here that I don't know.

@ErichBSchulz I am generally supportive of the direction of your thinking, but I would prefer to iterate more frequently rather than trying to overplan in the absence of strong expertise or a clear sense of requirements. Put the nice to have and the need to research more into a future version feature list / roadmap. Potentially massive amount work for vaguely articulated principles that may not be relevant to our use cases (e.g. we aren't designing for billion contact databases) isn't going to get buy-in. On the other hand, crisp explanations with a viable path to implementation and a clear benefit for devs or users will likely be accepted warmly.

ErichBSchulz commented 7 years ago

so to clarify the intent here, at this stage I'm just suggesting a one-line patch to the README so we put scalability on the radar... maybe the design principle needs to be "more scalable than V3" but that is horribly "meh".

Surely if we stick a "we would like to be scalable" sign on a piece of cardboard and sit around looking forlorn someone will flip us some pointers?

Maybe my experience with CiviCRM across 2 different organisation of different sizes is unusual and scalability isn't a problem for other installs...

@eileenmcnaughton yes I'm aware the REST full API will be aysnc - but my sense (?fear) is that simply popping a standard async wrapper around a vanilla API call may fail to achieve full scalability (ie the aysync patterns need to be part of the archicture and available at the PHP layer?)

again I dont know what this looks like practically - just wanting to pop a single line in the README...

eileenmcnaughton commented 7 years ago

OK - so if you want to put these things in a readme then perhaps create the issue as a PR - that would make it clearer what you want out of it.

TIme to alter the readme & close this I think

michaelmcandrew commented 7 years ago

I emailed Xavier about this as I know he has experience overloading civicrm's api via async js requests. His response: "in a nutshell: does not work. mostly the problem of the BAO/DAO".

He might be up for providing more insights / suggestions for ways around the problem...

tttp commented 7 years ago

Hi, so to test and reproduce the problems, you can use the nodejs module https://www.npmjs.com/package/civicrm if you have a concurrency level > 1 and do create several contacts (eg. with an employer) you will end up with errors and deadlocks https://civicrm.org/blog/xavier/civicrm-nodejs-helps-you-make-friends I'm not sure how to prevent that issue with api v4, as the issue is on the BAO level if I recall, but it's been a while since I experienced the problem, so the details need to be double checked

X+

eileenmcnaughton commented 7 years ago

It's that to do with the cache update issue. ie. every time you edit a contact it tries to flush the group contact cache & acl_contact_cache (the former can be disabled, the latter not yet)

xurizaemon commented 7 years ago

This comment has stuck with me for a few days.

I feel uncomfortable about the API exposing cache flags / switches. Caching feels like something which should be internal to an application, not exposed to clients.

Perhaps using the API both internally and externally places conflicting demands on the API design here?

IMO external clients shouldn't ever have access to outdated data, so why permit them to dictate what caches are applied to results they retrieve? API calls from within CiviCRM might have a need to set cache flags, but that still seems weird to me. API calls from outside CiviCRM should not be permitted to specify cache details - they should be supplied correct data at request time, surely?

The ways in which CiviCRM caches data are a bigger conversation (seems there are different places we cache things and different reasons), and I think those are factors in why we might feel a need to expose cache flags via API. But that seems the wrong motivation.

Are there any examples of massively scalable APIs which expose the ability to set cache flags to clients? That might help me understand the rationale for exposing same in CiviCRM.

ErichBSchulz commented 7 years ago

hi @xurizaemon - just to acknowledge I've read and heard this. I think caching needs a bit of unpicking still which is why I've opened another case ( #48 ). I think we may have moved on a bit since you posted this but I hear your concerns!