Salesforce BULK API Support

zubairov commented 6 years ago

Saleforce has a Bulk API that we could potentially use in the connector. Here is the enhancement request to discuss the advantages, drawbacks and implementation strategies of it's support in elastic.io

jhorbulyk commented 6 years ago

I believe the best way to handle this is to:

Create a "Batching" component (which is the inverse of the spliter component)
Create a framework with the sailor SDK where one step can instead of accepting and emitting a single message, the step can accept or emit a batch of messages

There is a similar problem addressed in the CSV write action of the CSV component and I believe that this is a generic problem which should be solved at the platform level as opposed to the component level.

zubairov commented 6 years ago

@jhorbulyk good idea with batching component, however the goal of Salesforce Batch API is to work with GBs of data, so batch of such suze is highly unpractical (and think about the passthough content too) and will explode the all components with standard memory allocation.

I believe we should use the same tacitcs like in CSV component - accumulate the data elswhere via streaming (e.g. steward) and then based on the time or "number-of-records" trigger de-stream the data from the location into the Salesforce Batch API.

jhorbulyk commented 6 years ago

I also believe that putting an entire batch in an AMQP message passed through RabbitMQ would be infeasible and that the engineering solution would be different. However, I am suggesting something like the following:

An action can be configured (in component.json) to either receive a single message or a batch of messages. If a batch of messages is selected then the following will happen:

Consider the following flow: Component A -> Component B where Component B wants to write a batch.

A developer and/or integrator would be able to configure (in either the component.json or in the UI) for Component B either/both the max batch size and the max time between batches.

I suggest that the above flow would implicitly be constructed: Component A -> Mapper A->B -> Batcher -> Component B Batch Action

A message in the above flow would have the following lifecycle:

It would be emitted from Component A
The mapping from A->B would be applied.
The message would arrive at the Batcher.
1. If no open batch exists, the Batcher will open a batch by creating a new file on steward that it can append to.
2. The Batcher will write the message to the batch
3. The Batcher will determine if the batch should be closed. The batch should be closed if
  - The number of messages in the batch = the max batch size
  - The time since the last batch published > max time between batches
4. If the batch is to be closed, the batcher will publish a message to the batch to AMQP
The Batch action will recieve the message. The sailor logic will convert the batch URL to a Generator or Iterator of messages which would then be passed to user code instead of a message parameter. As the generator/iterator is traversed, messages would be read from Steward.

I suppose it would also make sense if there was some background process that monitored open batches and closed them when the time since the last batch published > max time between batches.

zubairov commented 6 years ago

Interesting idea @jhorbulyk, especially the way how you suggest to handle batches in sailor however there are following drawbacks:

Mapping propagation drawback

As you have noted in your sample Batcher has no metadata by itself but should passthough metadata from the next component to the mapper, we don't have this concept rigth now and can't easily do it (in user-space).

Batch semantic is lost

Idea of creating reusable pieces / components is in the very core of e.io value however when creating a batcher component as suggested above actual batch semantic will be lost. For example if underlying component will not know that it works on the batch then it may not benifit from the 3rd party batch API (e.g. Salesforce Batch API), think about retries in case of failures or ack/nok semantic consuming incoming messages.

Based on the discussion above I don't believe we could create encapsulate batching functionality as part of dedicated component (at least at the momen) therefore we should build a reusable batching functionality on a different level (e.g. library level) and reuse it in the batch-oriented actions accordingly.

jhorbulyk commented 6 years ago

I don't believe we could create encapsulate batching functionality as part of dedicated component (at least at the momen) therefore we should build a reusable batching functionality on a different level (e.g. library level) and reuse it in the batch-oriented actions accordingly.

@zubairov The more I think about it, the more I agree.

jhorbulyk commented 5 years ago

Is this blocked by https://github.com/elasticio/projects/issues/140 ?

A3a3e1 commented 5 years ago

As the task is of a high complex, it should be investigated

kirill-levitskiy commented 4 years ago

The feature has been implemented in the PR https://github.com/elasticio/salesforce-component/pull/85

elasticio / salesforce-component

Salesforce BULK API Support #19

Mapping propagation drawback

Batch semantic is lost