astronomy-commons / hipscat-import

HiPSCat import - generate HiPSCat-partitioned catalogs
https://hipscat-import.readthedocs.io
BSD 3-Clause "New" or "Revised" License
5 stars 2 forks source link

Spruce up the way we check for original input files. #314

Closed delucchi-cmu closed 1 month ago

delucchi-cmu commented 2 months ago

Change Description

Closes #299

Solution Description

Converts paths to strings before performing comparison, so plain-text strings and Pathlib objects can be compared reasonably. Also, converts both sides into lists for comparisons.

Code Quality

Bug Fix Checklist

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.52%. Comparing base (8b19bc2) to head (2b1dd4d).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #314 +/- ## ======================================= Coverage 99.52% 99.52% ======================================= Files 25 25 Lines 1270 1270 ======================================= Hits 1264 1264 Misses 6 6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

camposandro commented 1 month ago

Looks good to me!

I just have a small question, should we be writing the set of unique paths, unique_file_paths, instead of input_paths to disk? It doesn't really change the behavior in this case since we create a set on read but it might improve readability.

https://github.com/astronomy-commons/hipscat-import/blob/48dc9acc3765954af0a3a46bbbc7b6f0650370c0/src/hipscat_import/pipeline_resume_plan.py#L190-L193

delucchi-cmu commented 1 month ago

Looks good to me!

I just have a small question, should we be writing the set of unique paths, unique_file_paths, instead of input_paths to disk? It doesn't really change the behavior in this case since we create a set on read but it might improve readability.

https://github.com/astronomy-commons/hipscat-import/blob/48dc9acc3765954af0a3a46bbbc7b6f0650370c0/src/hipscat_import/pipeline_resume_plan.py#L190-L193

I think it would change the behavior, but only in that it would be more correct. Since we're sorting and de-duping before we check, we should use the results of those operations in the write-to-disk.