erasmus-without-paper / ewp-specs-api-omobilities

Specifications of EWP's Outgoing Mobilities API.
MIT License
1 stars 4 forks source link

error recovery in update endpoints #31

Closed BavoNootaert closed 3 years ago

BavoNootaert commented 6 years ago

Using the UPDATE endpoints, the receiving institution can POST data to the sending institution (status and components studied).

However, the UPDATE endpoints don't allow the receiver of the POST request to refetch the data that was posted. This makes it hard to recover from errors. Imagine the POST request was processed, the server answered that everything is OK, but due to a bug, the data was not saved or incorrectly saved. The receiver would have to ask each partner to resend the the data. Moreover, if anything goes wrong, there's a high probability that it will have gone wrong for many of the partners that have POSTed updates. Each of these will have to be contacted, and as they will respond at different times, it will be a lot of work to keep track of who has responded and who hasn't.

Conversely, if you POST to many different partners, there probability that some partner will ask you to resend the data increases.

The CNR-based API's don't have this problem. In such an API, the receiving institution would send a notification that something has changed, and the sending institution would GET the actual data. If there was an error, the sending institution can simply GET the data again from each partner, no e-mails will need to be sent, and the problem is easily managed.

The actual arrival dates have already been moved to the Incoming Mobility API, which is CNR-based. I think we should also move the remaining UPDATE endpoints.

This issue evolved from

25 and #27

These issues were started because of problems with sending requests to the UPDATE endpoints asynchronously. As a result of the discussions, it was realized that there is also a problem with error recovery, even if UPDATE requests are sent synchronously. So it affects @erasmus-without-paper/all-members.

Comment https://github.com/erasmus-without-paper/ewp-specs-api-omobilities/issues/27#issuecomment-327413278 contains an argument against using a CNR. It is about possible conflicts when data is changed at both sides. As is explained further in that discussion, we believe this argument is invalid, since the sending institution remains the master of the data.

mkurzydlowski commented 6 years ago

Imagine the POST request was processed, the server answered that everything is OK, but due to a bug, the data was not saved or incorrectly saved.

We don't see this as a significant argument as such bugs may occur in many other places and there is not way to defend from all of them. It doesn't seem justified to make changes only for this reason.

Is there a more compelling reason to change update endpoint to CNR + GET?

If we find one then we might try to come up to some solution but we feel that we should separate the LA part from Outgoing Mobilities API first. We will mention it in a different issue if we decide to suggest such solution.

BavoNootaert commented 6 years ago

The problem is not the bug itself. Indeed, bugs can occur everywhere. It is about how one recovers from it. In all other API's, the EWP protocol itself can be used, either by GET'ing the data again, or by sending a CN so the client can GET the corrected data (depending on who made the error). That can probably be handled by the IT department or company that developed or hosts the software. End user interaction, if necessary, will happen through the EWP clients.

With the update endpoints, there is no EWP API to recover from the bug, and the partner institutions have to be contacted through some other channel. This requires more time and effort from the end users, who were not responsible for the bug in the first place.

mkurzydlowski commented 6 years ago

Switching to CNR + GET for change suggestions to outgoing mobilities would force clients to store such suggestions in their system. It doesn't seem justified for changes to LA.

As for the nominations part of an outgoing mobility (the status of a mobility) this could be resolved naturally if we defined two status fields for a mobility (see:https://github.com/erasmus-without-paper/ewp-specs-api-omobilities/issues/27#issuecomment-327414513). That way both partners would be the "masters" of one of them.

janinamincer-daszkiewicz commented 6 years ago

By the way, why are PUSHes so popular in various network services if -as you claim - they do not support error recovery?

georgschermann commented 6 years ago

We don't have a strong opinion for or against this.

But I think the error recovery would be the same in both scenarios.

When an POST Update encounters an error the sender should try to resend the POST after some time, or notice that the POST was undeliverable (the same as with CNRs). When the receiving host sends a 200 HTTP Code for the POST but fails to store it in the local system, then the case is unrecoverable in both scenarios. Because when the host receives a CNR request and and says he got it but is not able to store it locally, the host wouldn't know of any changes and wouldn't try to GET them.

So I think with a correct error handling on both sides, both approaches are recoverable in the same ways.

huntering commented 6 years ago

I've made a powerpoint trying to explain why this is so important and why recovery is not the same in both solutions. See the attached document. There might be other possible solutions but since we already have the concept of CNR+GET, I focused on that solution. 20180518 EWP CNR.pptx

janinamincer-daszkiewicz commented 6 years ago

In my opinion it is crucial that in every data exchange the master (owner) of the data is known and we avoid ambiguity resulting from the existence of many masters.

In all APIs where we support CNR/get, the situation is clear - one of the partners is the owner of the data (master), the other is the consumer of the data (slave).

In LA scenarios it is also the case. The sending institution is the master of the LA. The receiving institution is the consumer, and because of that is not allowed to operate in exactly the same (symmetric) way as the sending institution. The receiving institution can only make suggestions concerning changes in LA - by doing updates which may be taken into account by the sending institution, but may also be ignored. We do not have to bother so much if once in a while suggestions get lost. We would have to bother much more from the unclear ownership of the data.

In particular, not being the master of the data, the receiving institution does not have to store the preliminary versions of the LA nor suggestions made. The receiving institution just sends update with the information 'drop this course from your LA since it is given in Polish'. If the incoming student still wants to take this coure, good for him. In our implementation we will most probably store only the final version of LA. We don't have to be forced to store it only to let the partner get those suggestions from our server.

The APIs reflect this assymetric relation - the master sends CNR, the consumer uses get, then the consumer may do updates to send its suggestions.

If we would change updates to CNR+get the sending and the receiving would behave in exactly the same way and that would hardly reflect their assymetric relation. It would be difficult to decide what should be done in case both partners behave in a synchronous way, using CNR/get in parallel.

The error may occur during update but the probability of it is very low. It can also occur during CNR, get and any other API invocation. In all reasonable circumstances we will be able to recover from it.

Update is kind of PUSH - very popular way of data exchange supported by REST servers. Why is it unacceptable in the EWP network? Why applications all over the world do not bother so much about the error recovery problem?

Ghent started the discussion with the argument of synchronization in case of multiple servers, than they switched to the error recovery argument, in the last presentation we go back to the issue of many servers and synchronization again. If multiple servers are the problem, don't use them or solve the problem on your site. Why should all the others adapt to the situation of multiple servers in Ghent?

Summary:

  1. LA needs assymetric master-slave relationship between the sending and the receiving institution. Receiving institution should not be forced to send suggestions by CNR/get because it means the necessity to store the data in the local system.

  2. Nomination status is different. We might separate nominations from LA, use two different statuses, for one of them sending would be the master, for the other the receiving would be the master. Thus update for nomination would be replaced with CNR+get.

  3. The deadline passed and it seems that none of the partners supports the change request of Ghent.

Anthony, what is your opinion?

huntering commented 6 years ago

If we would provide a service for our users that allows them to manage suggestions concerning the LA, we would not accept that suggestions could get lost. We would not rely on any other institution to support this functionality and therefore store the data on our side. Managing these suggestions at the receiving should not depend on the API or the service being available at the sending. We might just fallback to sending a notification via e-mail with a link to information about the suggestion. While it is clear that the sending is the master of the LA, the receiving is the master of the suggestions.

I don't understand how suggestions that are delivered via update or get could be easier or harder to process.

If there is still doubt about the fact that error recovery is easier with the get solution, then this is not the right medium and we should setup a conference call. I would have hoped that my powerpoint would have tackled that problem, but clearly not.

We use update in our REST api's all the time even in EAI like this project. But firstly those api's are designed to handle the concurrency problem in a different way and secondly if something goes wrong we are in control of the data and can easily trigger the update again. In the last presentation I still address the recovery issue, look at the "force refresh" on page 10, I just tried to give the full picture.

I'm glad to see that you've agreed to exchange the statusses through CNR/get. I am convinced that we should do the same for suggestions and I wonder if anybody is still in doubt about the fact that recovery is not the same in both scenario's.

Maybe I can give another example with this post. If github would send it to you via e-mail (update), and you would lose the e-mail, then you would have to ask me to send it to you again. Luckily the e-mail is just a notification (cn) and you can always go back to this issue to see my message (get). I'm also happy that I, the sender, can reread it and that I don't have to rely on anybody else I sent the message to.

Do we need more examples? Can anybody give an example explaining that the recovery is the same?

georgschermann commented 6 years ago

I still think that it is the same situation. If you lose the CNR and don't know that you should ask for a GET then you have the exact same situation. If you lose the POST you could still ask for a re-POST, which might be usually possible the same way a re-GET would be possible, since both scenarios need to store the data. A re-POST is also easier to implement in my opinion than a GET on the other side. If you have to write them an email, it doesn't matter if you tell them to GET the information or if you tell them that you re-POST the information for student X.

The issue with multiple servers sending/processing requests is in our opinion also not a problem. In our case for sending the POSTs a message queue is used the same way as for emails, with error diagnosis, automatic re-sending, notification if still not possible, etc. And for request processing the same procedure as for all other regular requests applies. When multiple servers of our cluster get requests at the same time, they handle this synchronous at the database layer. If they get two POSTs at the same time, they store two POSTs, if they get a hundred, they store a hundred. If they can't store it, they deliver an error to the system trying to POST, where the queue mechanics as for all other pushes should apply. Since these are suggestions and are not meant to be directly put into productive data it matters even less. If our user sees multiple change suggestions with different data at the same timestamp he would have to call/email the sender to resolve this, and ask them why they send different data at the same time.

huntering commented 6 years ago

Thanks for your reply. I'm trying to get to the bottom of this so please bare with me.

How is asking for a re-POST the same as a re-GET? Doing a re-POST would require you to pick up the phone, contact the other institution and ask to do a manual action to trigger an update. Doing a re-GET doesn't require you to contact anybody? That you can do without help of the other institution, right?

Please explain?

georgschermann commented 6 years ago

How would your users decide to do a (re-)GET when the original CNR failed the same way a POST could/would fail. Would they just randomly/periodically issue GET requests in the hope that something has changed? Asking for a re-POST would be just telling the other Hei to repeat their button click, which is implemented in both scenarios (sends either a POST or CNR). But in a scenario where either the POST or CNR gets completely lost, none of the participants would know to ask for a re-POST or to issue a GET.

So in my opinion the error scenarios are identical. Either the initial request is completely lost and no party would know to do anything. Or the initial request fails but is identified as a failure on one side, and this side either issues a GET or asks for a re-POST (sending hei) or issues a re-POST or asks for a re-GET (receiving hei) which would make no difference in the outcome.

huntering commented 6 years ago

Thanks for your reply. Your answer shows we are still not on the same page here.

In our case end users are not bothered with error recovery. The system handles the communication with the partner institution completely transparently. If there is an outage for a certain period of time at the partner institution, our end users will never notice.

I'm talking about error recovery done by the technical team. (Maybe for you the technical team are also users?) Suppose at a certain moment in time somebody detects that the latest release of the software contains an error. Since that latest release 1000 mobilities have been processed and the suggestions stored are wrong. These 1000 mobilities are spread across 200 partner institutions. In the CNR/Get scenario, the technical team could write a script that simply iterates over the 1000 mobilities, gets the correct data and fixes the issue. In the update scenario it would require 200 institutions to be contacted with the request to replay the update?

How would you propose to handle this situation?

georgschermann commented 6 years ago

I see your point.

I think the end users have to be bothered with error recovery to some extent. Besides your scenario there could be hosts which don't implement the Update APIs or implement them but have them (permanantly) unavailable. At some point the user would have to be notified that updates cannot be sent to a hei or would have to be communicated on a different channel than EWP.

In the scenario described by you, we would most probably be able to recover from the generated data/logs. An error so severe that all incoming POSTs are discarded to the extent where they are not recoverable is very unlikely in my opinion. In our case the POSTs are stored as a separate data source (since they are only suggestions) and information which is displayed to the users is generated from the raw data of the POSTs.

Implementing new APIs to mitigate implementation faults is not feasible I think. This would affect other APIs the same was. What if oMobilities API is implemented wrong for a few weeks or so and delivers wrong IDs?

huntering commented 6 years ago

Suppose the oMobilities would deliver wrong ID's then the solution should be to fix the issue and generate the necessary notifications to let the other institution know about the corrected data.

We know that the current protocol does not solve all the issues that can arise and we are convinced that there are other modifications possible to improve it. However in this case we are just trying to convince everybody that the cnr/get scenario is better than the update scenario.

h3xw0rm commented 6 years ago

After analysing this issue and thinking about it, here at UPORTO we think that:

Although it is true what Gent says, in the current implementation if the server has some kind of fault and loses the updates, there is no possibility of recovering without contacting every institution directly by email or phone. It is impossible for us to implement a script that automatically recovers data in the network. It is also true what georgeschermann and others have been saying stating that this information is neither sensitive nor a priority and the server should handle these situations, for example storing the POST data for some time, for future recovery. We think everyone agrees that the CNR/Get scenario is better and we also think that is why we changed some APIs to it. We believe that it is not feasible to change the scheme at this point, given all the work and modifications that still need to be made, the discussions that we’ve all been watching among everyone, bearing in mind that partners need stable solutions to implement and the actual versions are already defined and working good.

janinamincer-daszkiewicz commented 6 years ago

We think everyone agrees that the CNR/Get scenario is better

I do not agree that is is better in case of LA that is why we changed some APIs to it. We changed only those in which it was possible to indicate the master of the piece of data exchanged. Otherwise I agree with the given arguments.