Open sidharthramesh opened 1 year ago
@sidharthramesh
Replace
validator.validate(flatComposition, template);
with
validator.validate(flatComposition, this.webTemplate);
@sidharthramesh Also you might want to take a look at https://github.com/ehrbase/performance-test/blob/main/src/main/java/org/ehrbase/webtester/service/LoaderServiceImp.java
cc: @vidi42
@stefanspiska thank you for the quick reply. With just that 1 change, it's already much faster:
1274 compositions - 87.4s
1108 compositions - 56.7s
631 compositions - 37.1s
= 181.2s
( 3.6 x faster)
Averaging around 60ms
per composition - so much better. But still technically the bottleneck xD.
I'll look at the Performance Test Loader and see if we can get it lower to around 30ms
.
For context, We're building a Nifi Processor that can ingest compositions in bulk after multiple other ETL pipelines.
@sidharthramesh
I do not thing what you are trying to do is a good idea. Its one thing to directly insert data to generate a test set or a initial import. But ehrbase is build on the assumption that it has exclusive written access to the data. The DB-Structure can and will change. Also you might hit concurrency and integrity issues. And in the end in the best case you just replicate our service layer.
If you need a batch which run in one transaction you can do that via the contribution Endpoint. (Now supported in the sdk). If you want throughput your pipeline needs to sent request in parallel. (Ideally the contribution Service would use parallel processing, but this is not yet implemented )
And finally if you do not want to have the rest overhead some other protocols could be added via a plugin, but plugins is a beta feature right now.
cc @birgerhaarbrandt , @vidi42 , @HolgerReiseVSys
Hey @stefanspiska,
I understand my solution is hacky, and yes, I totally expect the database schema to change over time. The points you made about concurrency and integrity are also bothering me now, and it's probably best to seek a proper solution to this - will come in handy for many clients.
2 key requirements to be able to do ETL well - idempotency and batching.
We tried using the EHRbase REST API first, but didn't meet these requirements:
Configuration information
Steps to reproduce
I'm trying to directly load compositions into a Postgres database using the SDK. The data is in the Simplified Flat format, and this needs to be validated and converted into the Database Native format.
The input data is a JSON array of multiple compositions (batches of 1000) that look like this:
A snippet from the script that does the conversion looks like:
The
put_composition
is a stored procedure on Postgres that will do what's necessary to create a composition, contribution, party and entries in the database.This takes about <
30ms
/ compositionActual result
Validation and transforming 3013 compositions in total took a total of - 661.1 seconds running on an M1 Macbook Air (running Java without Rosetta emulation).
The batches of
x
compositions each took:Averaging at
219ms
per composition.Expected result (Acceptance Criteria)
Running validation and transformation operations should at least be as fast as the database insert operation ~
30ms
to not make the validation and transformation process the choking point on ETL pipelines.Any other suggestions or workarounds to speed up the process would also be much appreciated!
Definition of Done