NASA-PDS / registry

PDS Registry provides service and software application necessary for tracking, searching, auditing, locating, and maintaining artifacts within the system. These artifacts can range from data files and label files, schemas, dictionary definitions for objects and elements, services, etc.
https://nasa-pds.github.io/registry
Apache License 2.0
3 stars 2 forks source link

Harvest OREX dataset from SBN-PSI web #196

Closed tloubrieu-jpl closed 1 year ago

tloubrieu-jpl commented 1 year ago

💡 Description

Find the dataset on https://arcnav.psi.edu/urn:nasa:pds:orex.ovirs:data_calibrated

We should download all the products of this collection and harvest them in the EN production registry

The reference to the labels and data files should still point on the SBN PSI web site: https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/

jordanpadams commented 1 year ago

@rchenatjpl can you help us load this data into our EN registry? we will eventually delete it, but we want to have this loaded for some demo purposes.

would you be able to help us out here? when running harvest, we want to load from our machine, but we should make sure the URL points to their data on their servers at https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/ .

rchenatjpl commented 1 year ago

Sure. To be clear, that collection seems to have 1.6 million files, and they're downloading very slowly. If anyone knows a better way than wget, please say so. And sorry I've forgotten, and correct me if I'm wrong, but the way to point to the PSI web site is to change this in the config file.

  <fileInfo processDataFiles="true" storeLabels="true">
    <fileRef replacePrefix="/path/to/archive" with="https://url/to/archive/" />
  </fileInfo>
jordanpadams commented 1 year ago

@rchenatjpl yeah... it is going to be very slow unfortunately. wget is all I know.

per the config file, that is correct! I think it will be something like:

<fileRef replacePrefix="/path/to/data/pds4/test-data/registry/" with="https://sbnarchive.psi.edu/pds4/orex/" />
jordanpadams commented 1 year ago

once you try to register the data, the ops:Data_File_Info/ops:file_ref should have valid URLs to SBN data.

rchenatjpl commented 1 year ago

@jordanpadams I need more disk space. I believe I'm responsible for killing https://pds.nasa.gov earlier. I freed up a little by moving two directories to /tmp on pdscloud-prod1, but I think I'll need more. See https://itsd-jira.jpl.nasa.gov/servicedesk/customer/portal/16/DSIO-3936

rchenatjpl commented 1 year ago

This one collection is enormous. Should I harvest it in pieces? Does harvest check against the collection.csv?

tloubrieu-jpl commented 1 year ago

Hi @rchenatjpl @jordanpadams , we could use our scalable harvest service for that job. @rchenatjpl let me know where that should be deployed ? I will help you with that. It is a different version of harvest which is meant to work on larger set of files.

tloubrieu-jpl commented 1 year ago

Actually @jordanpadams, we could ask @sjoshi-jpl to deploy the scalable harvest on AWS ECS to be able to scale it up and run parallel harvests. That could be a good demo for other nodes. The deployment might also be reused for nucleus/css.

jordanpadams commented 1 year ago

@tloubrieu-jpl we should maybe chat about this offline. architecturally is scalable harvest really built for the cloud? the way the services are built, they don't seem to be designed for a serverless environment? I may be wrong. This may require some rethinking of how to deploy this.

also, I actually think this would be a great benchmark testing for the standalone harvest? thoughts?

tloubrieu-jpl commented 1 year ago

@jordanpadams Whatever works will be good since the priority is to have these data ingested and you are right using scalable harvest adds some useless risks. We can discuss offline if we should try that, but may be not for this ticket.

I remember myself using standalone harvest for these data 1 or 2 years ago, and I created a python script to split the input and parallelize harvest. But we can have a first attempt where we use standalone harvest as-is on the full collection and see what happens.

rchenatjpl commented 1 year ago

@jordanpadams @tloubrieu-jpl Holy cow, how do we feel about errors? I'm going to plow ahead regardless. I'm finding duplicate lines in the massive file https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/collection_inventory_ovirs_data_calibrated.csv

% grep 20181102t040122s658_ovr_spacel2 data_calibrated/collection_inventory_ovirs_data_calibrated.csv P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0 P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0 P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2_calv2.fits::2.0 % wc data_calibrated/collection_inventory_ovirs_data_calibrated.csv 1597353 1597353 137482254 data_calibrated/collection_inventory_ovirs_data_calibrated.csv % sort data_calibrated/collection_inventory_ovirs_data_calibrated.csv | uniq | wc 1169346 1169346 101731008

tloubrieu-jpl commented 1 year ago

Let's assume harvest does not care. You can try to harvest the collection as-is. But I guess we should tell SBN-PSI about that.

@rchenatjpl you been able to download the full collection yet ?

tloubrieu-jpl commented 1 year ago

Oh but it is like 30% is duplicated. I am reading you wc results correctly ? For performance purpose we might gain some time if we clean that file up before harvest runs on it.

rchenatjpl commented 1 year ago

@tloubrieu-jpl @jordanpadams To be sure I'm doing something reasonable: I'm downloading parts of the collection, harvesting, then deleting those files to make room for more parts. I am replacing the prefix of the path with PSI's web site while harvesting. I have not approved any yet. If this is the wrong approach, please let me know soon. Thanks

tloubrieu-jpl commented 1 year ago

@rchenatjpl that looks reasonable to me but you would spare you some pain if you had a larger disk space. Where are you downloading the data ? On pdscloud-prod ?

rchenatjpl commented 1 year ago

Thanks, Thomas. I've been downloading onto production machine. du -k so far says 453514192, which is 453GB, which doesn't seem that much, but I think Andrew or someone said he increased the disk space for $DATA_HOME to 350GB. I've killed the production machine twice, which is still affecting my other work. I also have more to ingest. OMG, I'm looking at Carol's email now, and her total is 1206GB. The numbers from her individual directories often don't match what I downloaded, sometimes off by 2x, sometimes by something else. Eh, I'll just keep doing what I'm doing.

rchenatjpl commented 1 year ago

@tloubrieu-jpl @jordanpadams I may be done. I hope I harvested 1169346 labels. If being precise matters, is there a way to dump all the LIDs that start with urn:nasa:pds:orex.ovirs:data_calibrated:? I still wouldn't be able to give you an ironclad guarantee that the VIDs match.

tloubrieu-jpl commented 1 year ago

That is great @rchenatjpl , I was not able to find the collection itself yet but was able to see at least one of the observational products. We will need to change the status of the collection from staged to archived as well. I will do more investigation tonight hopefully and I'll let you know what remains to be done.

Thanks !

tloubrieu-jpl commented 1 year ago

@rchenatjpl ,

The number of products which lid starts with urn:nasa:pds:orex.ovirs:data_calibrated: is 1170078, which sound perfect

I confirm that I don't see the collection itself (with lid=urn:nasa:pds:orex.ovirs:data_calibrated). It is not in the registry.

Could you add it ? I guess when you loaded the products by parts, you missed it.

One that is done, you will be able to switch the archive status for the full collection with a single registry-mgr command:

    ./registry-manager set-archive-status -status archived -lidvid {the lidvid of the collection} -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth ...
rchenatjpl commented 1 year ago

I ingested collection* then tried to change the archive_status. Maybe it worked? [pds4@pdscloud-prod1 test]$ registry-manager set-archive-status -status archived -lidvid urn:nasa:pds:orex.ovirs:data_calibrated::11.0 -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth auth.txt [INFO] Setting product status. LIDVID = urn:nasa:pds:orex.ovirs:data_calibrated::11.0, status = archived [INFO] Setting status of primary references from collection inventory [ERROR] 10,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE] [pds4@pdscloud-prod1 test]$ [pds4@pdscloud-prod1 test]$

The collection LIDVID urn:nasa:pds:orex.ovirs:data_calibrated::11.0 shows ops:Tracking_Meta/ops:archive_status = "archived", as does one lower-level product, but i don't know if all got changed to "archived".

tloubrieu-jpl commented 1 year ago

Thanks very much @rchenatjpl we can see the collection and its members in the registry-api now. See https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated

jordanpadams commented 1 year ago

@tloubrieu-jpl are we sure everything was loaded? That timeout on connection worries me...

Also, new requirement for registry-mgr fault tolerance :-)