aehrc / pathling

Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.
https://pathling.csiro.au
Apache License 2.0
91 stars 12 forks source link

MIMIC-IV FHIR Import Crash - Could not initialize class org.apache.commons.text.lookup.StringLookupFactory #1783

Open cyrilzakka opened 8 months ago

cyrilzakka commented 8 months ago

Hello,

I'm following the instructions at https://github.com/kind-lab/mimic-fhir/blob/main/tutorial/mimic-fhir-tutorial-pathling.ipynb to import the MIMIC-IV FHIR ndjson into Pathling using the docker image here: https://hub.docker.com/r/aehrc/pathling. When running the following script:

from pathlib import Path
import requests
import json
import ndjson
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams.update({'font.size': 20})

from fhirclient.models.parameters import Parameters, ParametersParameter
from py_mimic_fhir.lookup import MIMIC_FHIR_PROFILES

import_folder = 'file:///usr/share/staging' 
server = 'http://localhost:8080/fhir'

def generate_import_parameters(import_folder, profile, resource, mode):
    param_resource = Parameters()

    param_resource_type = ParametersParameter()
    param_resource_type.name= 'resourceType'
    param_resource_type.valueCode = resource

    param_url = {}
    param_url['name'] = 'url'
    param_url['valueUrl'] = f'{import_folder}/{profile}.ndjson'

    param_mode = ParametersParameter()
    param_mode.name= 'mode'
    param_mode.valueCode = mode

    param_source = ParametersParameter()
    param_source.name = 'source'
    param_source.part = [param_resource_type, param_url, param_mode]
    param_resource.parameter = [param_source]

    return param_resource.as_json()

def post_import_ndjson(server, param):
    url = f'{server}/$import'

    resp = requests.post(url,  json = param, headers={"Content-Type": "application/fhir+json"} )
    return resp 

mode = 'merge' # overwrite for fresh load (but not really since need to merge Observations not overwrite)

for profile, item in MIMIC_FHIR_PROFILES.items():
    resource = item['resource']
    # ObservationChartevents too large and crashing all the observation searches
    if (profile != 'ObservationChartevents'):
        param = generate_import_parameters(import_folder, profile, resource, mode)
        resp = post_import_ndjson(server, param)
        print(f"{profile}: {resp.json()['issue'][0]['diagnostics']}")

I get the error:

mimic-pathling-1  | 20:31:40.214 [Executor task launch worker for task 1.0 in stage 34.0 (TID 456)] [] ERROR org.apache.spark.executor.Executor - Exception in task 1.0 in stage 34.0 (TID 456): Could not initialize class org.apache.commons.text.lookup.StringLookupFactory
mimic-pathling-1  | 20:31:40.214 [Executor task launch worker for task 38.0 in stage 34.0 (TID 493)] [] ERROR o.a.s.s.e.d.FileFormatWriter - Job job_202403182028225607545436317964348_0034 aborted.
mimic-pathling-1  | 20:31:40.215 [Executor task launch worker for task 38.0 in stage 34.0 (TID 493)] [] ERROR org.apache.spark.util.Utils - Uncaught exception in thread Executor task launch worker for task 38.0 in stage 34.0 (TID 493)
mimic-pathling-1  | java.lang.NullPointerException: null
mimic-pathling-1  |     at org.apache.spark.scheduler.Task.$anonfun$run$3(Task.scala:144)
mimic-pathling-1  |     at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1509)
mimic-pathling-1  |     at org.apache.spark.scheduler.Task.run(Task.scala:142)
mimic-pathling-1  |     at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
mimic-pathling-1  |     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
mimic-pathling-1  |     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
mimic-pathling-1  |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
mimic-pathling-1  |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
mimic-pathling-1  |     at java.base/java.lang.Thread.run(Thread.java:829)
mimic-pathling-1  | 20:31:40.215 [Executor task launch worker for task 38.0 in stage 34.0 (TID 493)] [] ERROR org.apache.spark.executor.Executor - Exception in task 38.0 in stage 34.0 (TID 493): Could not initialize class org.apache.commons.text.lookup.StringLookupFactory

Any help or guidance would be greatly appreciated.

johngrimes commented 8 months ago

Thanks for sending this through - let me try to reproduce and I will get back to you.

johngrimes commented 8 months ago

Hi @cyrilzakka,

I've had a go at loading the full MIMIC-IV data set to reproduce your problem.

I was unable to reproduce the exact error you described, but I did come across a couple of data quality issues that would make it impossible to import this data set in its current form.

I have documented them here:

I am also hanging out for these issues to be resolved so that I can start playing around with this wonderful data set!

I did fix the former issue using a script (attached to the issue), but the second issue is a bit more involved to fix using post-processing and I have not been able to get to a fully importable data set yet.