astronomy-commons / hipscat

Hierarchical Progressive Survey Catalog
https://hipscat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
17 stars 3 forks source link

Pass filesystem to file operations. #311

Closed delucchi-cmu closed 2 months ago

delucchi-cmu commented 2 months ago

Change Description

Closes #307

Solution Description

Passes any user-provided file_system object along to fsspec calls.

Code Quality

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 99.10714% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.75%. Comparing base (047600e) to head (6cf6d84).

Files Patch % Lines
src/hipscat/io/file_io/file_pointer.py 90.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #311 +/- ## ========================================== - Coverage 93.78% 93.75% -0.04% ========================================== Files 58 58 Lines 2028 2033 +5 ========================================== + Hits 1902 1906 +4 - Misses 126 127 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

github-actions[bot] commented 2 months ago
Before [047600e6] <v0.3.7> After [b3ae0e21] Ratio Benchmark (Parameter)
19.4±0.3ms 19.4±0.4ms 1.01 benchmarks.MetadataSuite.time_load_partition_info_order6
378±4ms 383±4ms 1.01 benchmarks.Suite.time_outer_pixel_alignment
41.2±0.7ms 41.8±0.7ms 1.01 benchmarks.Suite.time_pixel_tree_creation
119±0.4ms 120±0.5ms 1.01 benchmarks.time_test_alignment_even_sky
77.5±0.9ms 77.7±0.9ms 1 benchmarks.MetadataSuite.time_load_partition_info_order7
78.0±1ms 77.6±0.7ms 0.99 benchmarks.MetadataSuite.time_load_partition_join_info
89.4±2ms 88.4±2ms 0.99 benchmarks.Suite.time_paths_creation
13.3±0.4ms 13.1±0.3ms 0.98 benchmarks.Suite.time_inner_pixel_alignment
1.00±0.02ms 971±5μs 0.97 benchmarks.time_test_cone_filter_multiple_order

Click here to view all benchmarks.

hombit commented 2 months ago

I tried to run it

import gdrivefs
import lsdb

gdfs = gdrivefs.GoogleDriveFileSystem(token='cache', root_file_id='1mocyakfy_8OgFGOIQ813S7POqwdDtfX_')
lsdb.read_hipscat('', file_system=gdfs)

it failed with

FileNotFoundError: [Errno 2] No such file or directory: '/Users/hombit/projects/lincc-frameworks/lsdb/catalog_info.json'

While debugging I found that file_system is None in read_from_metadata_file()

delucchi-cmu commented 2 months ago

I tried to run it

import gdrivefs
import lsdb

gdfs = gdrivefs.GoogleDriveFileSystem(token='cache', root_file_id='1mocyakfy_8OgFGOIQ813S7POqwdDtfX_')
lsdb.read_hipscat('', file_system=gdfs)

it failed with

FileNotFoundError: [Errno 2] No such file or directory: '/Users/hombit/projects/lincc-frameworks/lsdb/catalog_info.json'

While debugging I found that file_system is None in read_from_metadata_file()

The GDFS implementation is weird, and I've found that I can't reference files if I pass the hipscat catalog as the root_file_id. If you have some directory structure in google drive, like the following:

└── [file_id=1000] hipscat/
    └── [file_id=0100] catalogs/
        ├── [file_id=0101] catalog_a/
        └── [file_id=0102] catalog_b/

Then you can do something like

gdfs = gdrivefs.GoogleDriveFileSystem(token='cache', root_file_id='1000')
lsdb.read_hipscat('catalogs/catalog_a', file_system=gdfs)

or

gdfs = gdrivefs.GoogleDriveFileSystem(token='cache', root_file_id='0100')
lsdb.read_hipscat('catalog_a', file_system=gdfs)

I don't know why this is the case.

hombit commented 2 months ago

@delucchi-cmu it still doesn't work for me =(

import gdrivefs
import lsdb

gdfs = gdrivefs.GoogleDriveFileSystem(token='cache', root_file_id='17_8v782e6kK22hAJ_p1AzHXmjLpKFT4w')
lsdb.read_hipscat('gaia_dr3_pm_greater_100', file_system=gdfs)

fails with

FileNotFoundError: [Errno 2] No such file or directory: '/Users/hombit/projects/lincc-frameworks/lsdb/gaia_dr3_pm_greater_100/catalog_info.json'

While i can do gdfs.open("gaia_dr3_pm_greater_100/catalog_info.json").read()