IDR / omero-mkngff

Plugin to swap OMERO filesets with NGFF
GNU General Public License v2.0
0 stars 2 forks source link

Add clientpath to Filesets #12

Closed will-moore closed 9 months ago

will-moore commented 10 months ago

Since existing FilesetEntry.clientpath values are set to unknown for mkngff Filesets, and we also don't have any reference to the original source of the data, we can set this value to something more useful.

This PR adds a --clientpath option which is a path or URL to the Fileset e.g. https://s3-server/bucket/data.zarr that corresponds to the mounted s3 Fileset /dir/path/to/data.zarr. This enables the creation of a clientpath for every file found under the mounted Fileset.

E.g.

$ omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

This creates sql output with a 4th clientpath item in each sql ROW. If the --clientpath option is not used as above then the placeholder unknown is added to each ROW in the sql, which results in the same outcome as before.

Tested at https://github.com/IDR/idr-utils/pull/56#issuecomment-1765313452

will-moore commented 10 months ago
idr0004,Screen:202,S-BIAD867
idr0010,Screen:1351,S-BIAD885
idr0011,Screen:1501,S-BIAD866
idr0011,Screen:1551,S-BIAD866
idr0011,Screen:1601,S-BIAD866
idr0011,Screen:1602,S-BIAD866
idr0011,Screen:1603,S-BIAD866
idr0012,Screen:1202,S-BIAD845
idr0013,Screen:1101,S-BIAD865
idr0013,Screen:1302,S-BIAD865
idr0015,Screen:1201,S-BIAD861
idr0016,Screen:1251,S-BIAD851
idr0025,Screen:1851,S-BIAD846
idr0026,Project:301,S-BIAD860
idr0033,Screen:1751,S-BIAD848
idr0035,Screen:2001,S-BIAD847
idr0036,Screen:1952,S-BIAD855
idr0051,Project:552,S-BIAD815
idr0054,Project:701,S-BIAD800
idr0090,Screen:2851,S-BIAD882
idr0091,Dataset:1351,S-BIAD852
pip install 'omero-mkngff @ git+https://github.com/will-moore/omero-mkngff@clientpath'
# 1 plate from idr0004
omero mkngff clientpath Plate:1751 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# all of idr0004
omero mkngff clientpath Screen:202 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# csv above...
for r in $(cat ngff_filesets.csv); do
  target=$(echo $r | cut -d',' -f2)
  biad=$(echo $r | cut -d',' -f3)
  echo $target
  omero mkngff clientpath $target "https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/$biad/"
done
will-moore commented 10 months ago

After running for nearly 8 hours, we have reached 14 plates into idr0012, (approx 400 plates done) so it will be at least another day before this is complete! This seems the wrong way to go when we've only just generated the filesets.

@joshmoore I wonder if we could teach the sql function mkngff_fileset() to populate the clientpath as in the description above? The trouble is that we don't want to regenerate all the sql files from scratch, although we could add in the base URL for a Fileset e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr into the parameter list.

Then, for each row in the array, e.g.

['demo_2/2015-10/01/07-25-30.185_mkngff/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/', '.zarray', 'application/octet-stream'],

we'd need to be able to generate the clientpath within the mkngff_fileset() function, possibly using .zarr to split the path here to get the relative path A/10/0/3, to add to the base URL along with the name to get: e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/.zarray

Is that possible within sql language?

will-moore commented 10 months ago

Still running...

Fileset 6312826
tosave 3061
Fileset 6312697
tosave 3061

This is taking 3 minutes per Fileset just now....

will-moore commented 10 months ago
get_filesets Screen:1251
Fileset 6313488
tosave 14610
joshmoore commented 10 months ago

Is that possible within sql language?

I'm not sure I fully understand but in general you can do anything with SQL if slightly more verbosely.

I like your idea of templating the output, but there would still need to be checks for the existence of the files, no?

will-moore commented 10 months ago

Having experimented with trying this in mkngff_fileset() function within setup.sql script I have given up and I'm going to simply pass the clientpath argument as a 4th item for each row that creates an OriginalFile.

This also means that we don't need the complex logic to resolve clientpath from path and name.

e.g.

$ omero mkngff sql 1591301 --clientpath="https://s3/path/to/image.zarr" /path/to/data/6001247.zarr

Found prefix: demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023 for fileset: 1591301

UPDATE pixels SET name = '.zattrs', path = 'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr' where image in (select id from Image where fileset = 1591301);

begin;
    select mkngff_fileset(
      1591301,
      'SECRETUUID',
      'cdf35825-def1-4580-8d0b-9c349b8f78d6',
      'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/',
      array[
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zattrs', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zattrs'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zgroup', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zgroup'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/0/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/0/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/1/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/1/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/2/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/2/.zarray']
      ]::text[][]
    );
commit;
will-moore commented 10 months ago

Tested at https://github.com/IDR/idr-utils/pull/56#issuecomment-1765313452 with:

omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

Re: @joshmoore "checks for the existence of the files" - I'm not sure what you mean, but in that example the clientpath values are set to files under https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr, but we don't check for their existence.