MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

JobEvent with outputs populated fails to write with nullPointerException #2925

Open seanmullane opened 1 month ago

seanmullane commented 1 month ago

Emitting a JobEvent with input and/or output datasets causes a HTTP500 error in the API, which results from a nullPointerException in Marquez.

Fixing this is important to allow static lineage graphs to be able to be generated without being associated with active runs. This is useful in cases where an integration is not yet available to consume pipeline runs for a given system or where a pipeline is not yet fleshed out but we want to enter the job in Marquez to see how it would relate to other jobs.

The attached code includes a purely json version generated the OpenLineage client which can prompt the bug in Marquez. I also included the python code the json derives from and the Marquez error log.

Environment:

Marquez 0.49.0 running via docker-compose per the Marquez example with --seed openlineage-python 1.22.0 python 3.11.9

nullPointerException.txt reproduce_bug.zip

More detail on this from phix on Slack:

It looks like we’re not processing the “outputFacets” on the IO fields without a runId provided. The event should save if you drop that field that’s the empty object for now… We should take a look at the OL spec for this

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

davidsharp7 commented 2 weeks ago

Looks like currently in the DatasetFacetsDao.java

  default void insertDatasetFacetsFor(
      @NonNull UUID datasetUuid,
      @NonNull UUID datasetVersionUuid,
      @Nullable UUID runUuid,
      @NonNull Instant lineageEventTime,
      @Nullable String lineageEventType,
      @NonNull LineageEvent.DatasetFacets datasetFacets) {

allows runid and lineageEventType to be null. Simplest solution would be to do the same for

insertInputDatasetFacetsFor insertOutputDatasetFacetsFor