MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 320 forks source link

Dataset Versions API Does Not Return All Versions Due to LIMIT and OFFSET Placement in SQL Query #2944

Closed inanalper closed 4 weeks ago

inanalper commented 4 weeks ago

Description: I've encountered an issue with the Marquez dataset versions API where not all dataset versions are returned, even when the limit parameter is set higher than the total number of versions.

Steps to Reproduce:

Prepare Data: Download the dump and initialize my database(Its 130 KB) https://drive.google.com/file/d/1T8LI-NRHg7Qxj_pi7CN0sRm0ZcssooxU/view API Request: Use the /api/v1/namespaces/s3a%3A%2F%2Fproduct-data/datasets/%2F4f5e4a74-d608-48b9-968b-b638ff80654f/versions Set Limit: Set the limit parameter to value 25, 100 and 1000. The returned list sizes will be 1, 3 and 6 respectively while the totalCount property is always 6. Notice that the API returns fewer versions than expected.

Expected Behavior:

The API should return all dataset versions up to the specified limit. If the limit exceeds the total number of versions, all versions should be returned.

Actual Behavior:

The API returns fewer versions than expected, and the number of versions returned does not match the total count, even when the limit is sufficiently high.

Cause:

The issue is due to the placement of the LIMIT and OFFSET clauses within the SQL query used in the DatasetVersionDao.findAll method. The LIMIT and OFFSET are applied within a Common Table Expression (CTE) before grouping and filtering, leading to inconsistent results.

I am going to open a PR to fix the placement according to your guidelines.

boring-cyborg[bot] commented 4 weeks ago

Thanks for opening your first issue in the Marquez project! Please be sure to follow the issue template!

wslulciuc commented 4 weeks ago

Thanks for reporting this @inanalper! and for the steps to reproduce 👍