caracal-pipeline / stimela

Stimela 2.0
GNU General Public License v2.0
5 stars 3 forks source link

add a dtype to support fsspec MSs #209

Closed o-smirnov closed 6 months ago

o-smirnov commented 7 months ago

Since dask-ms apps can use an S3 backend for their MSs, the current MS dtype is not quite adequate. Introduce a new type that is fsspec-aware. @JSKenyon @sjperkins got an example of how to query an fsspec?

JSKenyon commented 7 months ago

I think it is used as follows in dask-ms: https://github.com/ratt-ru/dask-ms/blob/a0043fba3eae3eabdbdd6e2fb1f22abf7d762dbb/daskms/fsspec_store.py#L17

Edit: Your use may actually be simpler as you probably don't need to know whether it is zarr, parquet or casa table backed.

o-smirnov commented 7 months ago

Just as a note to self before I forget: the reason this matters (as opposed to why just not make the MS name input a plain string) is that the singularity backend needs to know which directories need to be accessed, so that they can be bound inside the container. For MSs nested under the CWD, this doesn't matter since the CWD is always bound. Where this creates a problem is if the MS is somewhere else in the directory hirearchy.

o-smirnov commented 6 months ago

fsspec looks overly complicated for what I need, so rather not add the extra dependency. All I need to know is, is a given string a dask-ms URL or a path to a local file?

A simple regex will do. I just need to know what the possibilities to match are. Hence, question for @JSKenyon @sjperkins, is it true that all dask-ms URLs look like foo::bar://baz or bar://baz?

sjperkins commented 6 months ago

This is probably a reasonable subset:

o-smirnov commented 6 months ago

Thanks. Finally, what's a good name for this dtype? MSX? DaskMS? DMS?

sjperkins commented 6 months ago

I think the above are fairly generic url schema's. I wouldn't say they're dask-ms specific. Would a url dtype work?

sjperkins commented 6 months ago

I think the above are fairly generic url schema's. I wouldn't say they're dask-ms specific. Would a url dtype work?

Thinking about this a bit more, perhaps uri would be better than url as it references both local and remote datasets.