aodn / python-aodncore

The starting point for any new pipeline handler development containing all of the core functionality of each handler
https://aodn.github.io/python-aodncore/
GNU General Public License v3.0
1 stars 1 forks source link

add `*.rm_objects_manifest` to aodn core #149

Closed lbesnard closed 3 years ago

lbesnard commented 5 years ago

Add a new file extension to manifest files named *.rm_objects_manifest

This manifest file would contain s3 object path of files to remove from s3 such as IMOS/SRS/Surface-Waves/Wave-Wind-Altimetry-DM00/TOPEX/060N_220E/IMOS_SRS-Surface-Waves_MW_TOPEX_FV02_066N-236E-DM00.nc And unharvest the files from the DB. The full path would be required so it doesn't need to rely on the dest_path/physical files to find out the path of the files to remove

All pipelines could then use it by adding a simple `"allowed_extensions": [ '.rm_objects_manifest" ]``` in chef-private.

The Surface Altimeter pipeline would need this functionality I believe as we have 50000+ files to unharvest

ggalibert commented 5 years ago

Not sure we need code in aodncore to handle this (so far) one off situation. At this stage I think we can first try to identify the full URL of the file by looking in the DB.

Not sure if that would work OK for a a large number of files in terms of performance but as a one off solution I'm thinking about something like:

SELECT file_url
FROM srs_surface_waves.srs_surface_waves_map
WHERE file_url LIKE ANY(ARRAY['%file1.nc', '%file2.nc', '%file3.nc', etc...])
lbesnard commented 5 years ago

identifying the url is not the issue. The issue is how to unindex and remove them from s3 in a timely manner as calling the harvester for each file would take far too much time.

This feature could also be used by other harvesters every now and then

mhidas commented 5 years ago

Not sure if it's the best approach, but what @lbesnard is proposing might also be needed for #53, i.e. enable "manual" deletion of a list of files by a PO. I still use the old function to get rid of duplicates.

I don't think it would be a good idea to allow this sort of manifest file to be uploaded by external users.

bpasquer commented 5 years ago

A feature like what @lbesnard is suggesting would be good. It is something needed in the SOOP_CO2_RT workflow (at a much smaller scale though, ~500 files) to manually delete RT files upon reception of the related DM dataset. But I agree with @mhidas that such feature should only be available to POs, not external users.

ghost commented 3 years ago

Feature has been added to code, discussion to be continued elsewhere re: configuration considerations.