NBISweden / LocalEGA

Please go to to https://github.com/EGA-archive/LocalEGA instead
Apache License 2.0
4 stars 1 forks source link

Check with ES team what is trigger message from central EGA for file ingestion #31

Closed jhagberg closed 7 years ago

silverdaz commented 7 years ago

Answer from Jordi:

Once a file is referred, it is assigned status 0 or "New".
A periodic process is getting "New" files in batches using the database as channel.

For EGA 2.0 (microservices) we are sending messages to a queue
in order all interested microservices to get notified that
a new process is starting/requested.

[...]
I'm not happy with a mechanism that rely on an specific endpoint being up
and listening for a process that *should* happen and
that is not requiring synchronicity to proceed.
However, we can use the Endpoint mechanism temporarily.

In other words, Jordi prefers an asynchronous communication. Like, a message is dropped somewhere, and whoever is interested picks it up (eg in a message broker like RabbitMQ)

A solution for us (as a trigger-mechanism) is to contact their message broker directly, though using certificate for authentication, and pick the message if flagged with "Sweden", or something similar.

silverdaz commented 7 years ago

If we connect to their Message Broker and pick an interested message, we should think of having several connections open, for fault-tolerance. That covers the case where a connection is lost. We could pick the message and put it in our message broker, internally (even modifying it if necessary). The different connections should not duplicate the message.

Note: I think there is a functionality in RabbitMQ to handle that. Like routing to the right place, combined with queue filtering. Worth investigating.

silverdaz commented 7 years ago

Note from Jordi: they could send us a message to our message broker! If we don't have one, because we go for a simple multi-threaded implementation, then we could create one with a simple purpose: a mailbox for incoming requests. (RabbitMQ might be an overkill, and ZeroMQ might suffice).

If we have one, then we need to "secure" it. I prefer to let them do that. By simple laziness, and also to let them get the blame in case the broker is compromised. Nasty but nonetheless true.

In short: I prefer to pick a message from their queue, using a certificate for authentication.

silverdaz commented 7 years ago

Update: Jordi suggests to hook both message brokers, one from Spain and one from Sweden in a federated manner. That way, they could route messages to the swedish queue when necessary.

Juha is concerned about separating the components, so that if one is upgraded, we won't have issues with dependencies or incompatibilities.

It's important, and a solution is in AMQP (the Message Broker protocol). We're fine if both components, upgraded or not, do talk AMQP.

Implementing code that handles message exchange between message broker is, to me, redundant. That code would sit and wait for messages to drop in the queues and forward them to other queues. It introduces an extra layer which is good for security.

We are still very much in the discussion inside the LocalEGA Slack.

silverdaz commented 7 years ago

I am in favor of hooking our MQ to their MQ (including SSL communication).

That would be step 1. If this doesn't work, we then advise and think of a backup plan with ReST API calls.

Note: I already have some code for the ReST API, there is no effort to account for, here. Just that if we go for the MQ-linked-together solution, that code is then dormant.

silverdaz commented 7 years ago

Ok, after discussing with Oscar, I have bad and good news.

Bad news: CRG does not handle submissions, so there is no message containing submission data, like file and checksum paths.

Good news: We designed one.

The message, as a first try, will look like this. It is JSON-formatted:

{
 "@class" : "eu.crg.ega.microservice.dto.message.WorkFlowCommandMessage",
 "header" : {
   "format" : null,
   "producer" : {
     "host" : null,
     "ip" : null,
     "application" : "workflow",
     "processId" : null,
     "userId" : null
   },
   "messageId" : null,
   "conversationId" : null,
   "idInSequence" : null,
   "timestamp" : null,
   "millisecToExpiration" : null,
   "replyTo" : null
 },
 "messageType" : "COMMAND",
 "command" : {
   "commandType" : "WORKFLOW",
   "version" : "v1"
 },
 "parameters" : null,
 "submissionId":"asdasd-asdasd-asdasd",
 "encryptedHash":"fdfssdfsdf",
 "unencryptedHash":"adasdas",
 "hashAlgorithm": "md5",
 "file": "/path/to/file",
 "filesize":1231313,
 "fileUpdatedTimestamp":"",
 "fileCreatedTimestamp":""
}

It contains some extra things I don't frankly care about, but Oscar said they will matter later. So, fair enough, we keep them.

All I care is:

Parameter Comment
submissionId Gotten from Central EGA
encryptedHash he wants to send as string, not a path to a file containing the hash string
unencryptedHash idem
hashAlgorithm which I made him add, in case we want to handle other/better algorithms
file a path relative to the inbox
filesize,
fileUpdatedTimestamp,
fileCreatedTimestamp,
etc...
for logging, I guess

I raise some concern about the file path relative to the inbox, in case we setup several inboxes (like a plain FTP one, and some other one for Bianca, for example)

The latter is part of issue #5 related to the inbox setup. Not the concern here.

viklund commented 7 years ago

The fileUpdatedTimestamp and fileCreatedTimestamp things are useful if they resubmit a few files with the same names (I don't understand why we need both), so we (or the user) can distinguish them.

silverdaz commented 7 years ago

We still need a discussion to settle all those bits. What we need in the message. I leave it for the moment and I'll work on some JSON that I concoct.