astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 3 forks source link

Raen/verify/files #379

Open troyraen opened 1 month ago

troyraen commented 1 month ago

Related to #373, #374

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

Bug Fix Checklist

New Feature Checklist

Documentation Change Checklist

Build/CI Change Checklist

Other Change Checklist

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 29.62963% with 19 lines in your changes missing coverage. Please review.

Project coverage is 98.47%. Comparing base (e7f9b4c) to head (029196d).

Files Patch % Lines
...rc/hipscat_import/verification/run_verification.py 29.62% 19 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #379 +/- ## ========================================== - Coverage 99.72% 98.47% -1.26% ========================================== Files 26 26 Lines 1481 1508 +27 ========================================== + Hits 1477 1485 +8 - Misses 4 23 +19 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

troyraen commented 1 month ago

@delucchi-cmu here are some options. Take a look and let me know what you think. It's obviously not fully integrated yet. I did add some data that will make the tests fail; not sure if you want those extra files or if it should somehow use data that's already here. Also, it probably needs to use your custom file pointers and the user-supplied storage kwargs that you support, but I haven't added them yet. Also haven't done docstrings yet (these probably aren't the functions you actually want anyway).

There are four tests. The one checking the file schemas against _common_metadata is the thing you actually asked for. It feels a little incomplete if the _common_metadata isn't checked against a schema from the user, but if you don't want to rely on user input here I can understand that. The other tests are code that I had handy that's inline with checking individual parquet files. One is row counts, which I know you're taking care of in a different repo but I think you said that didn't check the individual files? In any case, I'm happy to take all those extra three tests out if you just want to focus on the schema right now.

I'm not sure how you want this integrated with the main run function. Right now, the functions I wrote return a boolean indicating pass/fail. What do you want it to do if it fails? I'm guessing it should not raise an error. Should it just print messages to std out, or create a report file, or ..?

github-actions[bot] commented 19 hours ago
Before [827aac0b] After [81cea4c9] Ratio Benchmark (Parameter)
failed failed n/a benchmarks.BinningSuite.time_read_histogram

Click here to view all benchmarks.