MillionConcepts / pdr

[P]lanetary [D]ata [R]eader - A single function to read all Planetary Data System (PDS) data into Python
Other
60 stars 6 forks source link

Flexible handling of paths to format files #36

Closed msbentley closed 1 year ago

msbentley commented 1 year ago

Hi pdr team,

I know this is listed as an area for improvement, but just wanted to flag it because we have a use case that really relies on this. We are trying to deploy data tutorials for Rosetta using ESA DataLabs (basically hosting Jupyter Notebooks and similar close to the data, with read-only direct access to the archive). For this use case we really need pdr to either

It would be great to know what your thoughts are and/or any timescales, so that we can prioritise instrument that don't use format files above those that do until there is a clear path.

Thanks again!

cmillion commented 1 year ago

We're going to try to implement the following two features by the end of this week: (1) Specify format file path. (2) Find dataset root if local storage.

We don't currently have near-term plans to support something like "find data set root if url."

msbentley commented 1 year ago

Fantastic, thanks @cmillion!

michaelaye commented 1 year ago

I’d just comment that anything that feels like outside the scope of a reader library we could manage to do in planetarypy. I’m already doing a fair amount of local data management in there, which just needs to be streamlined a bit so that it is as similar as possible between instruments. And it’s not easy to define the smallest common divider of required paths for datasets as the science case can vary a lot.

m-stclair commented 1 year ago

Hi Mark,

I've added a couple little features (on the develop branch) that should make this work for your specific use case: 1) If a target data/label file is at some depth under a 'data' or 'DATA' directory at some level, and it's missing a format file, pdr will check for format files in a 'LABEL' or 'label' subdirectory of the 'data'/'DATA' directory's parent directory. This is not very sophisticated at the moment but will work for most PDS3 archive volumes. 2) When initializing pdr.Data, including with the pdr.read / pdr.open constructor functions, you can explicitly specify additional search paths by passing a sequence of str or pathlib.Path as the search_paths kwarg. pdr.Data will check these search_paths for any files it need and can't find (in addition to the parent directory of the the data/label file you initialize it from). e.g.: data = pdr.read('/pds_data/TABLE.DAT', search_paths=('/home/michael/format_file_directory', '/pds_data/other_format_files/'))

Please let me know if one of these addresses your use case; I'm going to leave this issue open pending your testing.

m-stclair commented 1 year ago

@michaelaye I agree that there's a good chance that more sophisticated path management might be better addressed through a separate library. Could you point me to some of the pointy-end local data management stuff you're doing in planetarypy?

msbentley commented 1 year ago

Thanks @m-stclair tested by downloading a PDS3 product with associated format files:

https://archives.esac.esa.int/psa/pdap/download?CLIENT=epntap&RESOURCE_CLASS=PRODUCT&ID=RO-C-MIDAS-5-PRL-TO-EXT3-V2.0:DATA:CAH_0409522_1627308_01

and pointed pdr to the data product and it read AOK:

In [31]: cd RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/
/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0

In [32]: ls
AAREADME.TXT  DATA/         DOCUMENT/     LABEL/        VOLDESC.CAT

In [33]: ls LABEL/
CAH_STRUCTURE.FMT

In [34]: cd DATA/
/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/DATA

In [35]: ls
CAH_0409522_1627308_01.LBL  CAH_0409522_1627308_01.TAB

In [36]: cah = pdr.open('CAH_0409522_1627308_01.LBL')

In [37]: cah
Out[37]:
pdr.Data(/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/DATA/CAH_0409522_1627308_01.LBL)
keys=['LABEL', 'ARCHIVE_CONTEXT_DESC', 'TIP_IMAGE_CATALOG_DESC', 'EVENT_TABLE']
not yet loaded: ('LABEL', 'ARCHIVE_CONTEXT_DESC', 'TIP_IMAGE_CATALOG_DESC', 'EVENT_TABLE')

In [38]: cah.EVENT_TABLE
Out[38]:
        START_OBT                START_UTC     STOP_OBT                 STOP_UTC     EVENT  AC_GAIN  DC_GAIN  EXC_LVL  U_MAX    F_MAX SCAN_MODE
0      39742116.0  2004-04-04T23:28:50.021   39742259.0  2004-04-04T23:31:13.021  F-SEARCH        2        0        2   4.72  83706.4       DYN
1      72541228.0  2005-04-19T14:20:48.107   72541371.0  2005-04-19T14:23:11.107  F-SEARCH        2        0        2   1.56  83727.1       DYN
2      86771860.0  2005-10-01T07:18:02.464   86771878.0  2005-10-01T07:18:20.464  F-SEARCH        2        0        2   1.37  83751.4       DYN
3      87095860.0  2005-10-05T01:18:02.516   87095878.0  2005-10-05T01:18:20.516  F-SEARCH        2        0        2   1.27  83750.4       DYN
4     100121755.0  2006-03-04T19:36:18.629  100121773.0  2006-03-04T19:36:36.629  F-SEARCH        2        0        2   2.45  83759.4       DYN
...           ...                      ...          ...                      ...       ...      ...      ...      ...    ...      ...       ...
1337  432314286.0  2016-09-12T15:19:35.038  432315243.0  2016-09-12T15:35:32.038  SCANNING        0        0        0   0.00      0.0       DYN
1338  432315632.0  2016-09-12T15:42:01.038  432315787.0  2016-09-12T15:44:36.038  F-SEARCH        3        0        2  10.00  83630.2       DYN
1339  432317008.0  2016-09-12T16:04:57.039  432326768.0  2016-09-12T18:47:37.042  SCANNING        0        0        0   0.00      0.0       DYN
1340  432321711.0  2016-09-12T17:23:20.040  432321754.0  2016-09-12T17:24:03.040  F-SEARCH        3        0        2  10.00  83629.6       DYN
1341  432987271.0  2016-09-20T10:16:00.257  432987426.0  2016-09-20T10:18:35.257  F-SEARCH        3        0        2  10.00  83628.0       DYN

[1342 rows x 11 columns]

I didn't test the second use case yet, but the first looks great to me!

michaelaye commented 1 year ago

@michaelaye I agree that there's a good chance that more sophisticated path management might be better addressed through a separate library. Could you point me to some of the pointy-end local data management stuff you're doing in planetarypy?

Nothing too fancy, but see for example my CTX classes here, where general CTX class HAS an EDR class to refer to for source paths, while the CTX class manages a processing folder, with both related to a module/package-wide storage location as defined per configuration file: https://github.com/michaelaye/nbplanetary/blob/master/planetarypy/ctx.py#L145-L174

I have remaining work to do to generalize this scheme enough so that some basic features of this fit to every instrument, with only minor adaptations between different instruments.

msbentley commented 1 year ago

Hmm, having tested it locally to great success, I removed the conda pdr package and pip installed the dev branch of pdr on DataLabs and cannot get it to work here :-/ (the Rosetta/OSIRIS fix from the parallel ticket works fine, so the correct version is installed etc.). This is linux rather than MacOS, and the path is different, but otherwise similar data/setup. Any ideas?

image

m-stclair commented 1 year ago

pdr.Data expects a sequence of strings or paths for the search_paths kwarg, so that example will probably work if you change it to: c2 = pdr.read(c2_file, search_paths=('/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/LABEL',)) without the comma, it's probably trying to check '/' and 'm' and 'e' and 'd', etc. I will make this more permissive, because that Python gotcha where a single str inside parentheses is not cast to tuple[str] (like it would be to list[str] inside brackets) is very annoying.

I'm guessing the first example's not working because there's another 'data' in the path, and I didn't consider that case in the little hack I wrote -- it probably checked '/media/label' and '/media/LABEL'. I'll adjust for this case as well.

Thank you for your continued testing!

m-stclair commented 1 year ago

@michaelaye thank you, i will check it out!

msbentley commented 1 year ago

Thanks @m-stclair - yes, the trick with the search path worked fine for the second case. Ahh yes, understood re the "data" in the pathname - unfortunately that's out of my control now. In any case I can progress using the search path and am happy to test further tweaks as available :)

m-stclair commented 1 year ago

91f0604 adds sensible handling for str values of search_paths, and also changes the crude repository-root-finding so that it should work in tree structures like the one you're operating in. Let me know if this works for you!

Note that this will still create problems for archives with structures like:

...but there is only so much we can do!

m-stclair commented 1 year ago

(Of course, we could crawl up through every branch of the entire tree, but in some cases this could create dozens of extra checks per ancillary file, and pdr is used in several environments backed by slowish HDDs, so in many cases that would add a lot of fruitless I/O overhead.)

msbentley commented 1 year ago

Thanks again @m-stclair! I confirm that passing a simple string for search_paths now works as well! I'm still struggling with the directory tree scanning, though, even though my scenario looks pretty standard:

(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ ls -l
total 156
-rw-rw-r-- 1 202 50142   8115 Sep 10  2019 AAREADME.TXT
drwxrwxr-x 2 202 50142 110592 Sep 10  2019 BROWSE
drwxrwxr-x 2 202 50142   4096 Sep 10  2019 CALIB
drwxrwxr-x 2 202 50142   4096 Sep 10  2019 CATALOG
drwxrwxr-x 3 202 50142   8192 Sep 10  2019 DATA
drwxrwxr-x 3 202 50142   8192 Sep 10  2019 DOCUMENT
drwxrwxr-x 2 202 50142   4096 Sep 10  2019 INDEX
drwxrwxr-x 2 202 50142   4096 Sep 10  2019 LABEL
-rw-rw-r-- 1 202 50142   3342 Sep 10  2019 VOLDESC.CAT
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ cd DATA/
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/DATA$ ls 
CAH_0409522_1627308_01.LBL  CAH_0409522_1627308_08.LBL  CAH_0409522_1627308_15.LBL  TGH_0409522_1627308_04.TAB  TGH_0409522_1627308_11.TAB  TGH_0409522_1627308_18.TAB  TGH_0409522_1627308_25.TAB
CAH_0409522_1627308_01.TAB  CAH_0409522_1627308_08.TAB  CAH_0409522_1627308_15.TAB  TGH_0409522_1627308_05.LBL  TGH_0409522_1627308_12.LBL  TGH_0409522_1627308_19.LBL  TGH_0409522_1627308_33.LBL
CAH_0409522_1627308_02.LBL  CAH_0409522_1627308_09.LBL  CAH_0409522_1627308_16.LBL  TGH_0409522_1627308_05.TAB  TGH_0409522_1627308_12.TAB  TGH_0409522_1627308_19.TAB  TGH_0409522_1627308_33.TAB
CAH_0409522_1627308_02.TAB  CAH_0409522_1627308_09.TAB  CAH_0409522_1627308_16.TAB  TGH_0409522_1627308_06.LBL  TGH_0409522_1627308_13.LBL  TGH_0409522_1627308_20.LBL  TGH_0409522_1627308_34.LBL
CAH_0409522_1627308_03.LBL  CAH_0409522_1627308_10.LBL  IMG                         TGH_0409522_1627308_06.TAB  TGH_0409522_1627308_13.TAB  TGH_0409522_1627308_20.TAB  TGH_0409522_1627308_34.TAB
CAH_0409522_1627308_03.TAB  CAH_0409522_1627308_10.TAB  MID_PARTICLE_TABLE.LBL      TGH_0409522_1627308_07.LBL  TGH_0409522_1627308_14.LBL  TGH_0409522_1627308_21.LBL  TGH_0409522_1627308_35.LBL
CAH_0409522_1627308_04.LBL  CAH_0409522_1627308_11.LBL  MID_PARTICLE_TABLE.TAB      TGH_0409522_1627308_07.TAB  TGH_0409522_1627308_14.TAB  TGH_0409522_1627308_21.TAB  TGH_0409522_1627308_35.TAB
CAH_0409522_1627308_04.TAB  CAH_0409522_1627308_11.TAB  TGH_0409522_1627308_01.LBL  TGH_0409522_1627308_08.LBL  TGH_0409522_1627308_15.LBL  TGH_0409522_1627308_22.LBL  TGH_0409522_1627308_36.LBL
CAH_0409522_1627308_05.LBL  CAH_0409522_1627308_12.LBL  TGH_0409522_1627308_01.TAB  TGH_0409522_1627308_08.TAB  TGH_0409522_1627308_15.TAB  TGH_0409522_1627308_22.TAB  TGH_0409522_1627308_36.TAB
CAH_0409522_1627308_05.TAB  CAH_0409522_1627308_12.TAB  TGH_0409522_1627308_02.LBL  TGH_0409522_1627308_09.LBL  TGH_0409522_1627308_16.LBL  TGH_0409522_1627308_23.LBL  TGH_0409522_1627308_37.LBL
CAH_0409522_1627308_06.LBL  CAH_0409522_1627308_13.LBL  TGH_0409522_1627308_02.TAB  TGH_0409522_1627308_09.TAB  TGH_0409522_1627308_16.TAB  TGH_0409522_1627308_23.TAB  TGH_0409522_1627308_37.TAB
CAH_0409522_1627308_06.TAB  CAH_0409522_1627308_13.TAB  TGH_0409522_1627308_03.LBL  TGH_0409522_1627308_10.LBL  TGH_0409522_1627308_17.LBL  TGH_0409522_1627308_24.LBL  TGH_0409522_1627308_44.LBL
CAH_0409522_1627308_07.LBL  CAH_0409522_1627308_14.LBL  TGH_0409522_1627308_03.TAB  TGH_0409522_1627308_10.TAB  TGH_0409522_1627308_17.TAB  TGH_0409522_1627308_24.TAB  TGH_0409522_1627308_44.TAB
CAH_0409522_1627308_07.TAB  CAH_0409522_1627308_14.TAB  TGH_0409522_1627308_04.LBL  TGH_0409522_1627308_11.LBL  TGH_0409522_1627308_18.LBL  TGH_0409522_1627308_25.LBL
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/DATA$ cd ..
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ find . -iname "*data*" -type d
./DATA
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ find . -iname "*label*" -type d
./LABEL
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ 
m-stclair commented 1 year ago

hm, weird. I'm missing something here. Would it be possible for you to send me a minimal copy of this tree structure, containing all directories but just the files of interest?

msbentley commented 1 year ago

Sure - I'm assuming you don't need me to go above the dataset root? (all of the Rosetta datasets are stored here at /media/data/rosetta). Within the dataset in question we have:

(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ tree -d
.
├── BROWSE
├── CALIB
├── CATALOG
├── DATA
│   ├── CAH_0409522_1627308_01.LBL
│   ├── CAH_0409522_1627308_01.TAB
│   └── IMG
├── DOCUMENT
│   └── CODE
├── INDEX
└── LABEL
    ├── CAH_STRUCTURE.FMT
    └── TGH_STRUCTURE.FMT
m-stclair commented 1 year ago

cool, thank you. I will investigate.

m-stclair commented 1 year ago

This should be fully addressed in ade6626612d76e717548e4948da922c24df1a286 (included in this morning's v0.7.2 release).

msbentley commented 1 year ago

Thanks @m-stclair do you know when conda-forge will have the 0.7.2 release? I still see the prevous version through conda CLI and at https://anaconda.org/conda-forge/pdr

Sierra-MC commented 1 year ago

Hi @msbentley,

Thanks for bringing that to our attention. Looks like there was a snag our autoupdate conda bot hit that had to be remedied. That's been fixed in our feedstock so in the next couple of hours it should be updated on conda-forge.

msbentley commented 1 year ago

Thanks! In the meantime I tested with code from the repo and with this last update it works like a treat