OHDSI / WhiteRabbit

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
http://ohdsi.github.io/WhiteRabbit
Apache License 2.0
173 stars 85 forks source link

ScanReport cannot be used for fake-data-generation or rabbit-in-a-hat #367

Open thoniTUB opened 1 year ago

thoniTUB commented 1 year ago

Describe the bug The fake-data generator and rabbit-in-a-hat cannot read a scan report that was produced by white-rabbit under a german locale.

The following error (shorted) appears when running the fake-data generation:

*** Generic error information ***
Message: For input string: "0,679"
...

*** Stack trace ***
java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
java.base/java.lang.Double.parseDouble(Double.java:651)
org.ohdsi.utilities.files.QuickAndDirtyXlsxReader$Row.getDoubleByHeaderName(QuickAndDirtyXlsxReader.java:591)
org.ohdsi.rabbitInAHat.dataModel.Database.generateModelFromScanReport(Database.java:182)
org.ohdsi.whiteRabbit.fakeDataGenerator.FakeDataGenerator.generateData(FakeDataGenerator.java:53)
org.ohdsi.whiteRabbit.WhiteRabbitMain$FakeDataThread.run(WhiteRabbitMain.java:1076)

*** Console ***
An error report has been generated:
...\whiteRabbit/Error.txt
10.02.2023, 09:08:43    Starting creation of fake data
Loading scan report from ...\ScanReport_...csv.xlsx
Error: For input string: "0,679"

I search the report and found the cell in Field Overview>Fraction unique. The cell content was a string: <= 0,679

To Reproduce Steps to reproduce the behavior:

  1. Produce a ScanReport on a System with german locale activated
  2. Check if your ScanReport contains <= 0,... values in the Fraction unique column
  3. Try to generate fake-data with the report.

Expected behavior The produced ScanReport can be loaded by the fake-data generator and rabbit-in-a-hat.

Desktop (please complete the following information): Processor type: amd64 Available processors: 4 Maximum available memory: 1,258,291,200 bytes Used memory: 324,052,120 bytes Java version: 17.0.2 Java vendor: Oracle Corporation OS architecture: amd64 OS name: Windows 10 OS version: 10.0

Addtional Info After fixing all cells (<= 0, -> <= 0.), everything is fine.

thoniTUB commented 1 year ago

I submitted a simple fix for the problem: https://github.com/OHDSI/WhiteRabbit/pull/368

Another solution would be to discard the <= 0,/<= 0. syntax and solely rely on the double data type and percentage representation.