awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Apache License 2.0
3.32k stars 539 forks source link

adding to the committer list #450

Open haojiliu opened 1 year ago

haojiliu commented 1 year ago

we are a group of engineers interested in contributing to this project, is there a person that we can get in touch with to understand the status quo of this project?

rdsharma26 commented 1 year ago

The project is currently supported and maintained. We recently published a new release to Maven. The release targets Spark 3.3. AWS Glue Data Quality, which uses Deequ, was recently released at AWS re:Invent in November 2022.

Contributions to the project are welcome. We will be happy to review your pull requests.

haojiliu commented 1 year ago

thanks, I might have more stringent requirements in terms of SLA and types of commits allowed, as i'm working on a critical project. @rdsharma26 can i discuss with you on more details? Do you have an email i can reach out to?

i want to avoid the situation where i have to create my own forks for my own development while i can just commit into deequ itself.

rdsharma26 commented 1 year ago

@haojiliu In order to keep the Deequ related communication in one place and for posterity, our preferred method of communication is using Github Issues. We would love to hear more about your requirements and the changes you plan to make.

Forking the main repository and then creating pull requests against the main repository's master branch is the recommended way of contributing to the project. This is the same practice that our team is following.

haojiliu commented 1 year ago

thanks @rdsharma26 , there are two particular issues that we would like to get fixed asap once confirmed that can be repro'd in deequ:

  1. one might be non trivial which is filed by my team: https://github.com/awslabs/deequ/issues/426
  2. another we have yet to file but here's a one liner description: we are surprised by approx distinct count in the profiler is producing a number that's off from distinct count on a low cardinality column(the actual count is 247 but the approx count returned is 233). Hence would like to figure out why and whether we should trust on using approx distinct count in our data quality checks.

Q to you - will you or another active maintainer of this project be able to provide reasonably quick review/ship, say within 3-5days, if we propose the fix and make prs for those?

xza-m commented 1 year ago

@haojiliu We found the same problem.

rdsharma26 commented 1 year ago

@haojiliu Thank you for the information. We are actively maintaining this project and we will be reviewing any open PRs and providing feedback accordingly.

tanvn commented 1 year ago

@rdsharma26 @haojiliu @meimiao0730 We have experienced the same issue about ApproxCountDistinct as we got the following error before

ApproxCountDistinctConstraint(ApproxCountDistinct(hour,None)) : Failure Some(Value: 25.0 does not meet the constraint requirement! The approximate count distinct of hour column should be == 24.)

so we had to switch to use hasNumberOfDistinctValues instead https://github.com/awslabs/deequ/blob/d8bfb9c71bdce712d64d861343a801a5c5a9562c/src/main/scala/com/amazon/deequ/checks/Check.scala#L351