Closed guybartal closed 8 months ago
Can you please try a version from master branch? Where were some fixes related to Databricks dialect submitted. Maybe it will fix your problem.
I think @konstjar is correct but I will call out something that we saw with certain JDBC drivers for redshift (there's an open issue about it somewhere), but the issue was related to writing to cohort_cache
(which is our caching mechanism to cohort generation). I would definitely go with @konstjar suggestion first, but also there may be a 'leaking transaction' or some sort of cache-insert-conflict behavior happening where the cohort is being generated twice and the cache-inserts are leading to this conflict.
@chrisknoll and @konstjar , we're seeing the same issue when we try to generate multiple different cohorts concurrently (on the latest release: WebAPI 2.13.0 and Atlas 2.13.0).
The failing queries are like:
INSERT
OVERWRITE TABLE omop_results.cohort_inclusion_stats_cache
SELECT
*
FROM
omop__results.cohort_inclusion_stats_cache
WHERE
NOT (
design_hash = -1114425973
and mode_id = 1
)
and
WITH insertion_temp AS (
(
SELECT
662387645 as design_hash,
person_id,
start_date,
end_date
FROM
tmp_v0907.pel5lbbbfinal_cohort CO
)
UNION ALL
(
SELECT
design_hash,
subject_id,
cohort_start_date,
cohort_end_date
FROM
omop_results.cohort_cache
)
)
INSERT
OVERWRITE TABLE omop_results.cohort_cache (
design_hash,
subject_id,
cohort_start_date,
cohort_end_date
)
SELECT
*
FROM
insertion_temp
In both cases, it appears that the use of INSERT OVERWRITE TABLE may be causing the issue, rather than calling a DELETE statement followed by an INSERT.
Ok, this behavior goes back to a very very old discussion about DMBS support...(and I did object to this approach at the time):
There were some DBMS platforms that did not support the delete operator on a table, instead you would re-create the table by inserting everything EXCEPT the rows you meant to delete. My original objection is exactly what you raised here: what if there are 2 sessions doing the same thing? Which delete wins?
So, this is buried in the bowels of SqlRender, but this was done at least 5 years ago if I'm mistaken, and since then, maybe those platforms have support for DELETE FROM operations now? @TomWhite-MedStar , are you saying that DataBricks does allow the DELETE FROM {table} where ...
syntax? If so, we can resolve this in SqlRender.
Edit: here's my post! I bet that sand milkshake sounds pretty good about now.
@chrisknoll , if you can give me examples of DELETE FROM {table} syntax that needs to be supported, I'm happy to test them.
I just did the following and it worked fine on Databricks:
create table tmp.cohort_copy as select * from results.cohort where cohort_id in (305, 309, 581, 582);
delete from tmp.cohort_copy where cohort_id in (581, 582);
That would be it. Usually it's just a simple 'delete from table where col = value' to drop prior results.
If you can do DELETE FROM, then we need to figurout the sql render modification that will not substitute the DELETE FROM with INSERT OVERWRITE.
Edit: I think there is a case of a dual-column delete in the first exaple you gave:
WHERE
NOT (
design_hash = -1114425973
and mode_id = 1
)
This was origionally 'DELETE FROM .... WHERE design_hash = xxx and mode_id = 1'.
I confirmed that DELETE FROM .... WHERE design_hash = xxx and mode_id = 1
works on Databricks.
Ok, I'd suggest open an issue on sqlRender to make the change there, then the fix to this issue will be to update WebAPI dependency on SqlRender to the new version that fixes it.
Expected behavior
Generating Cohort Pathway should work on Databricks jdbc endpoint (tested on MS SQL and Azure Synapse and it works).
Actual behavior
Fails with the following error:
Steps to reproduce behavior