Closed msbentley closed 1 year ago
We're going to try to implement the following two features by the end of this week: (1) Specify format file path. (2) Find dataset root if local storage.
We don't currently have near-term plans to support something like "find data set root if url."
Fantastic, thanks @cmillion!
I’d just comment that anything that feels like outside the scope of a reader library we could manage to do in planetarypy. I’m already doing a fair amount of local data management in there, which just needs to be streamlined a bit so that it is as similar as possible between instruments. And it’s not easy to define the smallest common divider of required paths for datasets as the science case can vary a lot.
Hi Mark,
I've added a couple little features (on the develop branch) that should make this work for your specific use case:
1) If a target data/label file is at some depth under a 'data' or 'DATA' directory at some level, and it's missing a format file, pdr
will check for format files in a 'LABEL' or 'label' subdirectory of the 'data'/'DATA' directory's parent directory. This is not very sophisticated at the moment but will work for most PDS3 archive volumes.
2) When initializing pdr.Data
, including with the pdr.read
/ pdr.open
constructor functions, you can explicitly specify additional search paths by passing a sequence of str
or pathlib.Path
as the search_paths
kwarg. pdr.Data
will check these search_paths
for any files it need and can't find (in addition to the parent directory of the the data/label file you initialize it from). e.g.:
data = pdr.read('/pds_data/TABLE.DAT', search_paths=('/home/michael/format_file_directory', '/pds_data/other_format_files/'))
Please let me know if one of these addresses your use case; I'm going to leave this issue open pending your testing.
@michaelaye I agree that there's a good chance that more sophisticated path management might be better addressed through a separate library. Could you point me to some of the pointy-end local data management stuff you're doing in planetarypy?
Thanks @m-stclair tested by downloading a PDS3 product with associated format files:
and pointed pdr to the data product and it read AOK:
In [31]: cd RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/
/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0
In [32]: ls
AAREADME.TXT DATA/ DOCUMENT/ LABEL/ VOLDESC.CAT
In [33]: ls LABEL/
CAH_STRUCTURE.FMT
In [34]: cd DATA/
/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/DATA
In [35]: ls
CAH_0409522_1627308_01.LBL CAH_0409522_1627308_01.TAB
In [36]: cah = pdr.open('CAH_0409522_1627308_01.LBL')
In [37]: cah
Out[37]:
pdr.Data(/Users/mbentley/Downloads/RO-C-MIDAS-3-EXT3-SAMPLES-V3.0/DATA/CAH_0409522_1627308_01.LBL)
keys=['LABEL', 'ARCHIVE_CONTEXT_DESC', 'TIP_IMAGE_CATALOG_DESC', 'EVENT_TABLE']
not yet loaded: ('LABEL', 'ARCHIVE_CONTEXT_DESC', 'TIP_IMAGE_CATALOG_DESC', 'EVENT_TABLE')
In [38]: cah.EVENT_TABLE
Out[38]:
START_OBT START_UTC STOP_OBT STOP_UTC EVENT AC_GAIN DC_GAIN EXC_LVL U_MAX F_MAX SCAN_MODE
0 39742116.0 2004-04-04T23:28:50.021 39742259.0 2004-04-04T23:31:13.021 F-SEARCH 2 0 2 4.72 83706.4 DYN
1 72541228.0 2005-04-19T14:20:48.107 72541371.0 2005-04-19T14:23:11.107 F-SEARCH 2 0 2 1.56 83727.1 DYN
2 86771860.0 2005-10-01T07:18:02.464 86771878.0 2005-10-01T07:18:20.464 F-SEARCH 2 0 2 1.37 83751.4 DYN
3 87095860.0 2005-10-05T01:18:02.516 87095878.0 2005-10-05T01:18:20.516 F-SEARCH 2 0 2 1.27 83750.4 DYN
4 100121755.0 2006-03-04T19:36:18.629 100121773.0 2006-03-04T19:36:36.629 F-SEARCH 2 0 2 2.45 83759.4 DYN
... ... ... ... ... ... ... ... ... ... ... ...
1337 432314286.0 2016-09-12T15:19:35.038 432315243.0 2016-09-12T15:35:32.038 SCANNING 0 0 0 0.00 0.0 DYN
1338 432315632.0 2016-09-12T15:42:01.038 432315787.0 2016-09-12T15:44:36.038 F-SEARCH 3 0 2 10.00 83630.2 DYN
1339 432317008.0 2016-09-12T16:04:57.039 432326768.0 2016-09-12T18:47:37.042 SCANNING 0 0 0 0.00 0.0 DYN
1340 432321711.0 2016-09-12T17:23:20.040 432321754.0 2016-09-12T17:24:03.040 F-SEARCH 3 0 2 10.00 83629.6 DYN
1341 432987271.0 2016-09-20T10:16:00.257 432987426.0 2016-09-20T10:18:35.257 F-SEARCH 3 0 2 10.00 83628.0 DYN
[1342 rows x 11 columns]
I didn't test the second use case yet, but the first looks great to me!
@michaelaye I agree that there's a good chance that more sophisticated path management might be better addressed through a separate library. Could you point me to some of the pointy-end local data management stuff you're doing in planetarypy?
Nothing too fancy, but see for example my CTX classes here, where general CTX class HAS an EDR class to refer to for source paths, while the CTX class manages a processing folder, with both related to a module/package-wide storage location as defined per configuration file: https://github.com/michaelaye/nbplanetary/blob/master/planetarypy/ctx.py#L145-L174
I have remaining work to do to generalize this scheme enough so that some basic features of this fit to every instrument, with only minor adaptations between different instruments.
Hmm, having tested it locally to great success, I removed the conda pdr package and pip installed the dev branch of pdr on DataLabs and cannot get it to work here :-/ (the Rosetta/OSIRIS fix from the parallel ticket works fine, so the correct version is installed etc.). This is linux rather than MacOS, and the path is different, but otherwise similar data/setup. Any ideas?
pdr.Data
expects a sequence of strings or paths for the search_paths
kwarg, so that example will probably work if you change it to:
c2 = pdr.read(c2_file, search_paths=('/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/LABEL',))
without the comma, it's probably trying to check '/' and 'm' and 'e' and 'd', etc. I will make this more permissive, because that Python gotcha where a single str
inside parentheses is not cast to tuple[str]
(like it would be to list[str]
inside brackets) is very annoying.
I'm guessing the first example's not working because there's another 'data' in the path, and I didn't consider that case in the little hack I wrote -- it probably checked '/media/label' and '/media/LABEL'. I'll adjust for this case as well.
Thank you for your continued testing!
@michaelaye thank you, i will check it out!
Thanks @m-stclair - yes, the trick with the search path worked fine for the second case. Ahh yes, understood re the "data" in the pathname - unfortunately that's out of my control now. In any case I can progress using the search path and am happy to test further tweaks as available :)
91f0604 adds sensible handling for str
values of search_paths
, and also changes the crude repository-root-finding so that it should work in tree structures like the one you're operating in. Let me know if this works for you!
Note that this will still create problems for archives with structures like:
...but there is only so much we can do!
(Of course, we could crawl up through every branch of the entire tree, but in some cases this could create dozens of extra checks per ancillary file, and pdr
is used in several environments backed by slowish HDDs, so in many cases that would add a lot of fruitless I/O overhead.)
Thanks again @m-stclair! I confirm that passing a simple string for search_paths
now works as well! I'm still struggling with the directory tree scanning, though, even though my scenario looks pretty standard:
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ ls -l
total 156
-rw-rw-r-- 1 202 50142 8115 Sep 10 2019 AAREADME.TXT
drwxrwxr-x 2 202 50142 110592 Sep 10 2019 BROWSE
drwxrwxr-x 2 202 50142 4096 Sep 10 2019 CALIB
drwxrwxr-x 2 202 50142 4096 Sep 10 2019 CATALOG
drwxrwxr-x 3 202 50142 8192 Sep 10 2019 DATA
drwxrwxr-x 3 202 50142 8192 Sep 10 2019 DOCUMENT
drwxrwxr-x 2 202 50142 4096 Sep 10 2019 INDEX
drwxrwxr-x 2 202 50142 4096 Sep 10 2019 LABEL
-rw-rw-r-- 1 202 50142 3342 Sep 10 2019 VOLDESC.CAT
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ cd DATA/
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/DATA$ ls
CAH_0409522_1627308_01.LBL CAH_0409522_1627308_08.LBL CAH_0409522_1627308_15.LBL TGH_0409522_1627308_04.TAB TGH_0409522_1627308_11.TAB TGH_0409522_1627308_18.TAB TGH_0409522_1627308_25.TAB
CAH_0409522_1627308_01.TAB CAH_0409522_1627308_08.TAB CAH_0409522_1627308_15.TAB TGH_0409522_1627308_05.LBL TGH_0409522_1627308_12.LBL TGH_0409522_1627308_19.LBL TGH_0409522_1627308_33.LBL
CAH_0409522_1627308_02.LBL CAH_0409522_1627308_09.LBL CAH_0409522_1627308_16.LBL TGH_0409522_1627308_05.TAB TGH_0409522_1627308_12.TAB TGH_0409522_1627308_19.TAB TGH_0409522_1627308_33.TAB
CAH_0409522_1627308_02.TAB CAH_0409522_1627308_09.TAB CAH_0409522_1627308_16.TAB TGH_0409522_1627308_06.LBL TGH_0409522_1627308_13.LBL TGH_0409522_1627308_20.LBL TGH_0409522_1627308_34.LBL
CAH_0409522_1627308_03.LBL CAH_0409522_1627308_10.LBL IMG TGH_0409522_1627308_06.TAB TGH_0409522_1627308_13.TAB TGH_0409522_1627308_20.TAB TGH_0409522_1627308_34.TAB
CAH_0409522_1627308_03.TAB CAH_0409522_1627308_10.TAB MID_PARTICLE_TABLE.LBL TGH_0409522_1627308_07.LBL TGH_0409522_1627308_14.LBL TGH_0409522_1627308_21.LBL TGH_0409522_1627308_35.LBL
CAH_0409522_1627308_04.LBL CAH_0409522_1627308_11.LBL MID_PARTICLE_TABLE.TAB TGH_0409522_1627308_07.TAB TGH_0409522_1627308_14.TAB TGH_0409522_1627308_21.TAB TGH_0409522_1627308_35.TAB
CAH_0409522_1627308_04.TAB CAH_0409522_1627308_11.TAB TGH_0409522_1627308_01.LBL TGH_0409522_1627308_08.LBL TGH_0409522_1627308_15.LBL TGH_0409522_1627308_22.LBL TGH_0409522_1627308_36.LBL
CAH_0409522_1627308_05.LBL CAH_0409522_1627308_12.LBL TGH_0409522_1627308_01.TAB TGH_0409522_1627308_08.TAB TGH_0409522_1627308_15.TAB TGH_0409522_1627308_22.TAB TGH_0409522_1627308_36.TAB
CAH_0409522_1627308_05.TAB CAH_0409522_1627308_12.TAB TGH_0409522_1627308_02.LBL TGH_0409522_1627308_09.LBL TGH_0409522_1627308_16.LBL TGH_0409522_1627308_23.LBL TGH_0409522_1627308_37.LBL
CAH_0409522_1627308_06.LBL CAH_0409522_1627308_13.LBL TGH_0409522_1627308_02.TAB TGH_0409522_1627308_09.TAB TGH_0409522_1627308_16.TAB TGH_0409522_1627308_23.TAB TGH_0409522_1627308_37.TAB
CAH_0409522_1627308_06.TAB CAH_0409522_1627308_13.TAB TGH_0409522_1627308_03.LBL TGH_0409522_1627308_10.LBL TGH_0409522_1627308_17.LBL TGH_0409522_1627308_24.LBL TGH_0409522_1627308_44.LBL
CAH_0409522_1627308_07.LBL CAH_0409522_1627308_14.LBL TGH_0409522_1627308_03.TAB TGH_0409522_1627308_10.TAB TGH_0409522_1627308_17.TAB TGH_0409522_1627308_24.TAB TGH_0409522_1627308_44.TAB
CAH_0409522_1627308_07.TAB CAH_0409522_1627308_14.TAB TGH_0409522_1627308_04.LBL TGH_0409522_1627308_11.LBL TGH_0409522_1627308_18.LBL TGH_0409522_1627308_25.LBL
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0/DATA$ cd ..
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ find . -iname "*data*" -type d
./DATA
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ find . -iname "*label*" -type d
./LABEL
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$
hm, weird. I'm missing something here. Would it be possible for you to send me a minimal copy of this tree structure, containing all directories but just the files of interest?
Sure - I'm assuming you don't need me to go above the dataset root? (all of the Rosetta datasets are stored here at /media/data/rosetta
). Within the dataset in question we have:
(psa) mbentley@datalab-5a2b5c4f321e308f-77bbbd7b6-c2t8c:/media/data/rosetta/RO-C-MIDAS-5-PRL-TO-EXT3-V2.0$ tree -d
.
├── BROWSE
├── CALIB
├── CATALOG
├── DATA
│ ├── CAH_0409522_1627308_01.LBL
│ ├── CAH_0409522_1627308_01.TAB
│ └── IMG
├── DOCUMENT
│ └── CODE
├── INDEX
└── LABEL
├── CAH_STRUCTURE.FMT
└── TGH_STRUCTURE.FMT
cool, thank you. I will investigate.
This should be fully addressed in ade6626612d76e717548e4948da922c24df1a286 (included in this morning's v0.7.2 release).
Thanks @m-stclair do you know when conda-forge will have the 0.7.2 release? I still see the prevous version through conda CLI and at https://anaconda.org/conda-forge/pdr
Hi @msbentley,
Thanks for bringing that to our attention. Looks like there was a snag our autoupdate conda bot hit that had to be remedied. That's been fixed in our feedstock so in the next couple of hours it should be updated on conda-forge.
Thanks! In the meantime I tested with code from the repo and with this last update it works like a treat
Hi pdr team,
I know this is listed as an area for improvement, but just wanted to flag it because we have a use case that really relies on this. We are trying to deploy data tutorials for Rosetta using ESA DataLabs (basically hosting Jupyter Notebooks and similar close to the data, with read-only direct access to the archive). For this use case we really need pdr to either
It would be great to know what your thoughts are and/or any timescales, so that we can prioritise instrument that don't use format files above those that do until there is a clear path.
Thanks again!