Closed tloubrieu-jpl closed 1 year ago
@rchenatjpl can you help us load this data into our EN registry? we will eventually delete it, but we want to have this loaded for some demo purposes.
would you be able to help us out here? when running harvest, we want to load from our machine, but we should make sure the URL points to their data on their servers at https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/ .
Sure. To be clear, that collection seems to have 1.6 million files, and they're downloading very slowly. If anyone knows a better way than wget, please say so. And sorry I've forgotten, and correct me if I'm wrong, but the way to point to the PSI web site is to change this in the config file.
<fileInfo processDataFiles="true" storeLabels="true">
<fileRef replacePrefix="/path/to/archive" with="https://url/to/archive/" />
</fileInfo>
@rchenatjpl yeah... it is going to be very slow unfortunately. wget is all I know.
per the config file, that is correct! I think it will be something like:
<fileRef replacePrefix="/path/to/data/pds4/test-data/registry/" with="https://sbnarchive.psi.edu/pds4/orex/" />
once you try to register the data, the ops:Data_File_Info/ops:file_ref
should have valid URLs to SBN data.
@jordanpadams I need more disk space. I believe I'm responsible for killing https://pds.nasa.gov earlier. I freed up a little by moving two directories to /tmp on pdscloud-prod1, but I think I'll need more. See https://itsd-jira.jpl.nasa.gov/servicedesk/customer/portal/16/DSIO-3936
This one collection is enormous. Should I harvest it in pieces? Does harvest check against the collection.csv?
Hi @rchenatjpl @jordanpadams , we could use our scalable harvest service for that job. @rchenatjpl let me know where that should be deployed ? I will help you with that. It is a different version of harvest which is meant to work on larger set of files.
Actually @jordanpadams, we could ask @sjoshi-jpl to deploy the scalable harvest on AWS ECS to be able to scale it up and run parallel harvests. That could be a good demo for other nodes. The deployment might also be reused for nucleus/css.
@tloubrieu-jpl we should maybe chat about this offline. architecturally is scalable harvest really built for the cloud? the way the services are built, they don't seem to be designed for a serverless environment? I may be wrong. This may require some rethinking of how to deploy this.
also, I actually think this would be a great benchmark testing for the standalone harvest? thoughts?
@jordanpadams Whatever works will be good since the priority is to have these data ingested and you are right using scalable harvest adds some useless risks. We can discuss offline if we should try that, but may be not for this ticket.
I remember myself using standalone harvest for these data 1 or 2 years ago, and I created a python script to split the input and parallelize harvest. But we can have a first attempt where we use standalone harvest as-is on the full collection and see what happens.
@jordanpadams @tloubrieu-jpl Holy cow, how do we feel about errors? I'm going to plow ahead regardless. I'm finding duplicate lines in the massive file https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/collection_inventory_ovirs_data_calibrated.csv
% grep 20181102t040122s658_ovr_spacel2 data_calibrated/collection_inventory_ovirs_data_calibrated.csv P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0 P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2.fits::1.0 P,urn:nasa:pds:orex.ovirs:data_calibrated:20181102t040122s658_ovr_spacel2_calv2.fits::2.0 % wc data_calibrated/collection_inventory_ovirs_data_calibrated.csv 1597353 1597353 137482254 data_calibrated/collection_inventory_ovirs_data_calibrated.csv % sort data_calibrated/collection_inventory_ovirs_data_calibrated.csv | uniq | wc 1169346 1169346 101731008
Let's assume harvest does not care. You can try to harvest the collection as-is. But I guess we should tell SBN-PSI about that.
@rchenatjpl you been able to download the full collection yet ?
Oh but it is like 30% is duplicated. I am reading you wc results correctly ? For performance purpose we might gain some time if we clean that file up before harvest runs on it.
@tloubrieu-jpl @jordanpadams To be sure I'm doing something reasonable: I'm downloading parts of the collection, harvesting, then deleting those files to make room for more parts. I am replacing the prefix of the path with PSI's web site while harvesting. I have not approved any yet. If this is the wrong approach, please let me know soon. Thanks
@rchenatjpl that looks reasonable to me but you would spare you some pain if you had a larger disk space. Where are you downloading the data ? On pdscloud-prod ?
Thanks, Thomas. I've been downloading onto production machine. du -k so far says 453514192, which is 453GB, which doesn't seem that much, but I think Andrew or someone said he increased the disk space for $DATA_HOME to 350GB. I've killed the production machine twice, which is still affecting my other work. I also have more to ingest. OMG, I'm looking at Carol's email now, and her total is 1206GB. The numbers from her individual directories often don't match what I downloaded, sometimes off by 2x, sometimes by something else. Eh, I'll just keep doing what I'm doing.
@tloubrieu-jpl @jordanpadams I may be done. I hope I harvested 1169346 labels. If being precise matters, is there a way to dump all the LIDs that start with urn:nasa:pds:orex.ovirs:data_calibrated:? I still wouldn't be able to give you an ironclad guarantee that the VIDs match.
That is great @rchenatjpl , I was not able to find the collection itself yet but was able to see at least one of the observational products.
We will need to change the status of the collection from staged
to archived
as well. I will do more investigation tonight hopefully and I'll let you know what remains to be done.
Thanks !
@rchenatjpl ,
The number of products which lid starts with urn:nasa:pds:orex.ovirs:data_calibrated:
is 1170078, which sound perfect
I confirm that I don't see the collection itself (with lid=urn:nasa:pds:orex.ovirs:data_calibrated). It is not in the registry.
Could you add it ? I guess when you loaded the products by parts, you missed it.
One that is done, you will be able to switch the archive status for the full collection with a single registry-mgr
command:
./registry-manager set-archive-status -status archived -lidvid {the lidvid of the collection} -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth ...
I ingested collection* then tried to change the archive_status. Maybe it worked? [pds4@pdscloud-prod1 test]$ registry-manager set-archive-status -status archived -lidvid urn:nasa:pds:orex.ovirs:data_calibrated::11.0 -es https://search-en-prod-di7dor7quy7qwv3husi2wt5tde.us-west-2.es.amazonaws.com:443 -auth auth.txt [INFO] Setting product status. LIDVID = urn:nasa:pds:orex.ovirs:data_calibrated::11.0, status = archived [INFO] Setting status of primary references from collection inventory [ERROR] 10,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE] [pds4@pdscloud-prod1 test]$ [pds4@pdscloud-prod1 test]$
The collection LIDVID urn:nasa:pds:orex.ovirs:data_calibrated::11.0 shows ops:Tracking_Meta/ops:archive_status = "archived", as does one lower-level product, but i don't know if all got changed to "archived".
Thanks very much @rchenatjpl we can see the collection and its members in the registry-api now. See https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:orex.ovirs:data_calibrated
@tloubrieu-jpl are we sure everything was loaded? That timeout on connection worries me...
Also, new requirement for registry-mgr fault tolerance :-)
💡 Description
Find the dataset on https://arcnav.psi.edu/urn:nasa:pds:orex.ovirs:data_calibrated
We should download all the products of this collection and harvest them in the EN production registry
The reference to the labels and data files should still point on the SBN PSI web site: https://sbnarchive.psi.edu/pds4/orex/orex.ovirs/data_calibrated/