OHDSI / WhiteRabbit

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
http://ohdsi.github.io/WhiteRabbit
Apache License 2.0
185 stars 90 forks source link

RabbitInAHat fails to load a custom model if BOM is set #411

Closed BillCM closed 3 months ago

BillCM commented 7 months ago

Describe the bug I created a custom model in Excel (XLSX) and exported to CSV. This file failed to load and resulted in this error

java.lang.IllegalArgumentException: Mapping for table not found, expected one of [table, field, required, type, schema, description] at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:121) at org.ohdsi.rabbitInAHat.dataModel.Database.generateModelFromCSV(Database.java:117) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.doSetTargetCustom(RabbitInAHatMain.java:465) at org.ohdsi.rabbitInAHat.RabbitInAHatMain.lambda$createMenuBar$9(RabbitInAHatMain.java:268)

The problem is that CSV parsing does not account for the Byte Order Mark (BOM).

To Reproduce Steps to reproduce the behavior:

  1. Create a custom model in Excel (XLSX)
  2. Export the model to CSV
  3. Use Edit -> Set Target Database -> Load Custom
  4. Select the exported CSV, receive a popup error that the column headers were not found.

Expected behavior CSV opens correctly.

Workaround Open the exported CSV in a text editor and change the encoding from "UTF-8 with BOM" to "UTF-8"

Desktop (please complete the following information):

Additional context The issue

janblom commented 7 months ago

@BillCM it is possible to attach the CSV file that causes the problem to this issue (or a stripped down one, as long as it causes the same problem). This can can save me some time when building a test case.

Thanks, Jan

BillCM commented 6 months ago

@janblom Correct. CSV exported from Excel have the Byte Order Mark set. The only way to make RabbitInAHat to read the file is to remove the BOM by changing the encoding. Perhaps this it worth a note in the docs?

janblom commented 6 months ago

Hi @BillCM , thank you for reporting this issue.

I have prepared a fix already which adds flexibility, so that RabbitInAHat can read CSV's with and without a BOM. This will be part of the upcoming 1.0 release. (Unfortunately testing another aspect of that release is taking some time. )

Since this issue will be fixed, it is not necessary to update the docs. This issue will serve as the (temporary) documentation until the fix is released, and the issue closed. (the fix is in my employers public repo until I have it approved and merged into the OHDSI repo).

If possible, could you attach a CSV to this issue that I can use to reproduce the bug? While I am fairly confident that the upcoming fix will cover your case, there is nothing better than having the certainty :-)

Thanks, Jan

BillCM commented 6 months ago

@janblom I think this very issue is causing the build to break. It appears that the embedded CSVs for CDM5.0 and CDM5.1 and their stem models are all being identified as Excel encoded with BOM. This is causing the mvn build to fail for main branch on my machine.

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException: Input length = 1 -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.3.1:resources (default-resources) on project rabbitinahat: filtering /Users/bill/ext/WhiteRabbit/rabbitinahat/src/main/resources/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv to /Users/bill/ext/WhiteRabbit/rabbitinahat/target/classes/org/ohdsi/rabbitInAHat/dataModel/StemTableV5.0.csv failed with MalformedInputException

Upon opening the code in IntelliJ, the 4 files in question are linked to the Excel icon and will not open for editing.

Screenshot 2024-04-30 at 2 00 50 PM

After converting the files to UTF-8, the build works.

janblom commented 6 months ago

I am unable to reproduce the last report, both on Linux and MacOS. Could it be that the csv files related to this were inadvertedly changed? I suspect an encoding problem (setting in your machine, such as locale) but I am unable to verify that. Since this is very likely not related to the issue reported here first, I will not investigate this further in this context. If you do think this is a problem of the WhiteRabbit project, please report this in a separate issue.

It is in any case not related to the first problem reported in this issue (I was able to confirm that). The original problem is now fixed in the release-1.0.0 branch, including a second BOM related issue in WhiteRabbit. It will be in the planned 1.0.0 release, hopefully soon.

janblom commented 6 months ago

A fix for the first issue reported in this thread is included with the second release candidate of version 1.0.0

janblom commented 3 months ago

Seolved in WhiteRabbit version 1.0.0