OHDSI / ETL-CDMBuilder

ETL-CDMBuilder is a repo containing a .NET Core application to perform ETL to OMOP CDM for multiple databases
Apache License 2.0
50 stars 32 forks source link

Builder stuck at Chunks creation #32

Closed erwanlc closed 7 years ago

erwanlc commented 7 years ago

I am running ETL-CDMBuilder on Windows Server 2016 to convert Truven MDCR data to CDM format. I managed to create the builder database and the first step 'CDM database creation' but I am stuck at the Chunks creation. I don't have any timelines indication and Errors is also empty so I can't figure out what's going on.

image

When I look in the table 'Building' from the builder database I only have this: image

Does anyone have an idea to solve this issue ?

bradanton commented 7 years ago

Which setting did you used in following?: 1)Number of batches 2)MaxDegreeOfParallelism 3)BatchSize

erwanlc commented 7 years ago

I am using the default settings to test the tool: 1)Number of batches = 10 2)MaxDegreeOfParallelism = 1 3)BatchSize = 1000

bradanton commented 7 years ago

Did you try to stop the builder then close and start again?

erwanlc commented 7 years ago

Yes, I did it several times. I also tried to restart from scratch by deleting the builder database without any success. I have the feeling that when I click on running, it doesn't do anything because even for the CDM database creation I need to do it by right click > Create tables. If just I launch the building process, the tables are not created. Is there a way to launch each step manually ?

bradanton commented 7 years ago

Okay, I have found the reason. The script for creation builder database is outdated. You need to recreate [Building] table and add [SubChunk] table, scripts below:

DROP TABLE [dbo].[Building];

CREATE TABLE [dbo].[Building]( [Id] [int] NOT NULL, [CreateDestinationDbStart] [datetime] NULL, [CreateDestinationDbEnd] [datetime] NULL, [CreateChunksStart] [datetime] NULL, [CreateChunksEnd] [datetime] NULL, [CreateLookupStart] [datetime] NULL, [CreateLookupEnd] [datetime] NULL, [BuildingStart] [datetime] NULL, [BuildingEnd] [datetime] NULL, [CopyVocabularyStart] [datetime] NULL, [CopyVocabularyEnd] [datetime] NULL, [CreateIndexesStart] [datetime] NULL, [CreateIndexesEnd] [datetime] NULL, [AchillesStart] [datetime] NULL, [AchillesEnd] [datetime] NULL, CONSTRAINT [PK_Building] PRIMARY KEY CLUSTERED ( [Id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

CREATE TABLE [dbo].[SubChunk]( [ChunkId] [int] NOT NULL, [Index] [int] NOT NULL, [MinPersonId] [bigint] NOT NULL, [MaxPersonId] [bigint] NOT NULL, [PersonCount] [int] NOT NULL, [Saved] [tinyint] NOT NULL CONSTRAINT [DF_SubChunk_Saved] DEFAULT ((0)), CONSTRAINT [PK_SubChunk] PRIMARY KEY CLUSTERED ( [ChunkId] ASC, [Index] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

GO

ALTER TABLE [dbo].[SubChunk] WITH CHECK ADD CONSTRAINT [FK_SubChunk_Chunk] FOREIGN KEY([ChunkId]) REFERENCES [dbo].[Chunk] ([Id]) ON DELETE CASCADE GO

ALTER TABLE [dbo].[SubChunk] CHECK CONSTRAINT [FK_SubChunk_Chunk] GO

erwanlc commented 7 years ago

Thanks for your help. I am now able to execute all the steps but nothing is stored in the destination database except the vocabulary. All the CDM tables are empty (there is nothing in the table PERSON for example). Do you think it could be because of the way I stored my data ? Does the DLLs to create the tables for Truven datasets are availables ?

bradanton commented 7 years ago

Most likely the the issue with your data and not with the schema. You can easily check, all queries are located here: {Builder folder}\TruvenV5\

1) You need to check {Builder folder}\TruvenV5\Batch.sql - this query used on Chunk creation step inside query your just need to replace {sc} to your schema name dbo or etc. and replace {0} with Number of batches * BatchSize

if query returned no data, so problem is here, take a look to query WHERE clause

2) All queries that are used for getting data for generating CDM tables are here: {Builder folder}\TruvenV5\Definitions\

erwanlc commented 7 years ago

The query doesn't return any data. Thanks for providing me the path. I will take a look on these files.

bradanton commented 7 years ago

No problem. Feel free to ask. With Batch.sql first of all you need to check RX field (must be '1') and DOBYR field (must be < 1900 or > the current year) the both fields are in ENROLLMENT_DETAIL table.

erwanlc commented 7 years ago

I am testing the first year (2011) of mdcd for now.

I put the last clause where of the BatchMDCD.sql in comments as I had no data because of it:

WITH CTE_SUMMARY AS (
SELECT ENROLID, MIN(DOBYR) MIN_DOB, MAX(DOBYR) AS MAX_DOB, MIN(ed.SEX) AS MIN_SEX, MAX(ed.SEX) AS MAX_SEX
FROM dbo.ENROLLMENT_DETAIL ed
GROUP BY ENROLID
)
SELECT DISTINCT {0} z.ENROLID, z.ENROLID
FROM dbo.ENROLLMENT_DETAIL z
LEFT JOIN (
SELECT ENROLID, DATEPART(year, MIN(DTSTART)) AS MIN_ENROLL_YEAR
FROM dbo.ENROLLMENT_DETAIL
GROUP BY ENROLID ) m ON m.ENROLID = z.ENROLID
WHERE z.ENROLID IS NOT NULL AND (DRUGCOVG = '1' OR MEDICARE = '1') AND z.DOBYR BETWEEN 1900 AND DATEPART(year, GETDATE())
AND z.DOBYR <= (m.MIN_ENROLL_YEAR+1) -- and z.ENROLID NOT IN ( -- SELECT ENROLID
-- FROM CTE_SUMMARY
-- WHERE MIN_DOB != 0 AND MAX_DOB > DATEPART(year, GETDATE()) - 90 AND ABS(MIN_DOB-MAX_DOB) > 2 -- ) ORDER BY z.ENROLID

But now I have this error during the chunks creation:

Builder: GDCPC900208
Time: Sep 12 2017  5:04PM
Error: Exception Found:
Type: System.FormatException
Message: Input string was not in a correct format.
Source: mscorlib
Stacktrace:    at System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
   at System.Number.ParseInt64(String value, NumberStyles options, NumberFormatInfo numfmt)
   at System.Convert.ToInt64(String value)
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.<>c.<AddSubChunks>b__8_0(KeyValuePair`2 c) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 51
   at System.Linq.EnumerableSorter`2.ComputeKeys(TElement[] elements, Int32 count)
   at System.Linq.EnumerableSorter`1.Sort(TElement[] elements, Int32 count)
   at System.Linq.OrderedEnumerable`1.<GetEnumerator>d__1.MoveNext()
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.AddSubChunks(Int32 chunkId, IEnumerable`1 chunk) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 51
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.CreateChunks() in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 97
   at org.ohdsi.cdm.framework.core.Controllers.BuilderController.PerformAction(Action act) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\BuilderController.cs:line 80
bradanton commented 7 years ago

Which data type you uses for ENROLID? (ENROLLMENT_DETAIL table)

erwanlc commented 7 years ago

It is REAL, I didn't change what SQL Server suggests. I guess I should have.

I also tried the convertion on the year 2011 of MDCR but I have an error saying it doesn't find the GEOLOC table. Unfortunately I have no clue about what this table about.

bradanton commented 7 years ago

EGEOLOC - field from MDCR and not exists in MDCD, MDCD and CCAE\MDCR have bit different schema. You need to use appropriate Vendor for each. capture

erwanlc commented 7 years ago

Yes, I noticed, but when using the right vendor for the MDCR I still have this error:

Message: ERROR [42S02] [Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid object name 'dbo.GEOLOC'.

I saw the variable EGEOLOC in the tables of MDCR but here, it seems the tool is looking for a table named GEOLOC and I don't know what it is.

ericaVoss commented 7 years ago

I believe Truven provided the GEOLOC table that defines EGEOLOC (Geographic Location of Employee) that is used to map state. It basically relates a EGEOLOC to a state.

erwanlc commented 7 years ago

Thanks for this answer. I will ask internally if we have it.

Do you have any idea for the trouble above trouble ? I get it when I try to convert the mdcd dataset:

Builder: GDCPC900208
Time: Sep 12 2017  5:04PM
Error: Exception Found:
Type: System.FormatException
Message: Input string was not in a correct format.
Source: mscorlib
Stacktrace:    at System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
   at System.Number.ParseInt64(String value, NumberStyles options, NumberFormatInfo numfmt)
   at System.Convert.ToInt64(String value)
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.<>c.<AddSubChunks>b__8_0(KeyValuePair`2 c) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 51
   at System.Linq.EnumerableSorter`2.ComputeKeys(TElement[] elements, Int32 count)
   at System.Linq.EnumerableSorter`1.Sort(TElement[] elements, Int32 count)
   at System.Linq.OrderedEnumerable`1.<GetEnumerator>d__1.MoveNext()
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.AddSubChunks(Int32 chunkId, IEnumerable`1 chunk) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 51
   at org.ohdsi.cdm.framework.core.Controllers.ChunkController.CreateChunks() in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\ChunkController.cs:line 97
   at org.ohdsi.cdm.framework.core.Controllers.BuilderController.PerformAction(Action act) in C:\Users\E613015a\Documents\ETL-CDMBuilder-master\ETL-CDMBuilder-master\source\Framework\org.ohdsi.cdm.framework.core\Controllers\BuilderController.cs:line 80
bradanton commented 7 years ago

This issue related with ENROLID, check ENROLLMENT_DETAIL, ENROLID must be INT or BIGINT

erwanlc commented 7 years ago

Ok, I think I am getting closer. I checked the queries in the folder: {Builder folder}\TruvenV5\Definitions\ and they return some data. The tool seems also to work better as I don't get errors and the process time is longer than before.

However, I still don't have anything in the destination tables after the step 'Conversion to CDM'. For example, my destination table PERSON is empty but the query in ENROLLMENT_DETAIL_MDCD.xml returns some data.

Any idea ?

bradanton commented 7 years ago

Did you try the query from ENROLLMENT_DETAIL_MDCD.xml with your real chunkId? I mean did you replaced {0} with existing chunkId? JOIN {sc}._chunks ch ON ch.ChunkId = {0} AND ENROLID = ch.PERSON_ID

erwanlc commented 7 years ago

Yes I did, I took a chunckId from the table _chunks. And it returns me what I believe should be inserted in the table PERSON. When I compare to what the Truven_MDCD_ETL_CDM_V5.0.1.docx describe, it seems good.

bradanton commented 7 years ago

Okay then, person records can be filtered out by three reason:

  1. person has unknown gender
  2. gender changed over different enrollment period
  3. person has an OBSERVATION_PERIOD that is >= 2 years prior to the YEAR_OF_BIRTH
erwanlc commented 7 years ago

I just checked and these conditions are not filled. I really don't understand why no data are inserted when the 'Conversion to CDM' step takes more than 30min to execute for 10 batch of 1000.

bradanton commented 7 years ago

Can you share 10 latest records from Builder Log table, query below:

SELECT TOP 10 [Type] ,[Time] ,[Message] FROM [dbo].[Log] order by Id desc

erwanlc commented 7 years ago

Thanks a lot for your time and patience. Here the 10 latest records of the Log table: image

The corresponding csv file for complete messages: log.zip

bradanton commented 7 years ago

hmm.. I see only OUTPATIENT_SERVICES query in log, can you also check that ENROLLMENT_DETAIL queries were executed, query below.

SELECT count(*) FROM [dbo].[Log] where [Message] like '%ENROLLMENT_DETAIL%'

erwanlc commented 7 years ago

As you can see, queries were executed: image

The last 10 ENROLLMENT_DETAIL queries: log_enrollment_detail.zip

I tested several queries from the log and they all return me data.

bradanton commented 7 years ago

Can you try another test, use the same source database and new destination with following setting:

1)Number of batches = 10 2)MaxDegreeOfParallelism = 1 3)BatchSize = 10

erwanlc commented 7 years ago

Still no results, the logs are here if you want to take a look:

log_nb10_mdop1_bs10.zip

bradanton commented 7 years ago

okay, really odd, the same version of cdm builder works well on my side.

lets try following: 1)Number of batches = 1 2)MaxDegreeOfParallelism = 1 3)BatchSize = 1

erwanlc commented 7 years ago

With these settings, the tool get stucks at 'Chunks creation', it seems to load the step indefinitely.

Are you sure I may not be caused by the data type I choose for my source tables ? I was think that a filter on wrong data format could return an empty result and thus not inserting the converted data in the database.

ericaVoss commented 7 years ago

@bradanton would it make sense for us to release an updated builder on GitHub for @erwanlc to try. We haven't released an update in 4 month.

bradanton commented 7 years ago

@erwanlc we will share new version of the cdm builder v 1.9.0.32 later today

Builder db was changed so you need to start from scratch, create new builder db using cdm builder and be careful with setting, new cdm builder contains default setting for Redshift environment. I recommend to use previous settings for tests:

1)Number of batches = 10 2)MaxDegreeOfParallelism = 1 3)BatchSize = 1000

erwanlc commented 7 years ago

Ok, thanks a lot for your help and time. I will keep your informed after trying the new version.

ericaVoss commented 7 years ago

@erwanlc - @bradanton and I are posted a new copy of Builder to GitHub. That might be another thing that will help. We are asking Truven if we could share the DDL but have not gotten the OK yet.

erwanlc commented 7 years ago

Thanks a lot for your help, by realoading all my source data in the database and using your last build, it seems to work now ! Anyway, I think sharing the DDL could be a great help for people using this tool. I hope Truven will agree. Thanks again for your time.

ericaVoss commented 7 years ago

@erwanlc, I'm trying to get the DDL to you however Truven is trying to figure our your license. Could you email me directly to try to get this resolved? evoss3 AT its.jnj.com?