IDR / omero-mkngff

Plugin to swap OMERO filesets with NGFF
GNU General Public License v2.0
0 stars 2 forks source link

sql performance #10

Closed will-moore closed 10 months ago

will-moore commented 11 months ago

Currently some big NGFF Plates take 1 or 2 hours for the mkngff sql command to complete. This is a blocker since some studies have hundreds of such plates.

Trying to gauge how long it should take to list the files in these filesets, mounted s3 buckets...

Testing with Plate from idr0013 (380 Wells, single Field each): https://ome.github.io/ome-ngff-validator/?source=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/

Each Well has 8 non-chunk files - listing them is fast...

[wmoore@test120-omeroreadwrite ~]$ time find /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1 -name .z*
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/.zattrs
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/.zgroup
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/.zattrs
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/.zgroup
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/0/.zarray
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/1/.zarray
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/2/.zarray
/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/3/.zarray

real    0m0.364s
user    0m0.014s
sys 0m0.071s

List all .zattrs in first row of 24 Wells takes ~ 1min - 1min 43secs - MUCH slower to walk the dirs...

[wmoore@test120-omeroreadwrite ~]$ time find /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/ -name .zattrs | wc
     48      48    5934

real    1m43.954s
user    0m0.473s
sys 0m1.440s

Similar time if we list ALL non-chunk files with -name .z*.

To list all non-chunk files for the Plate... - faster than expected!

[wmoore@test120-omeroreadwrite ~]$ time find /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/ -name .z*

real    8m13.544s
user    0m7.029s
sys 0m22.057s

Testing performance of walk() as used by mkngff with this walk.py test script...

import argparse
import sys
from pathlib import Path
from typing import Generator, Tuple
from datetime import datetime

start_time = datetime.now()

def walk(path: Path) -> Generator[Tuple[Path, str, str], None, None]:
    for p in path.iterdir():
        if not p.is_dir():
            print (p.parent, p.name, "application/octet-stream", datetime.now() - start_time)
            yield (p.parent, p.name, "application/octet-stream")
        else:
            if (p / ".zarray").exists() or (p / ".zgroup").exists():
                print(p.parent, p.name, "Directory", datetime.now() - start_time)
                yield (p.parent, p.name, "Directory")
                yield from walk(p)
            else:
                # Chunk directory
                print("chunk dir", p.parent, datetime.now() - start_time)
                continue

def main(argv):
    parser = argparse.ArgumentParser()
    parser.add_argument('dir', help='Walk this dir')
    args = parser.parse_args(argv)
    p = Path(args.dir)
    list(walk(p))

if __name__ == '__main__':
    main(sys.argv[1:])

Then tested... walk() of A/1 took 2mins 5secs

$ python walk.py /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1
...
chunk dir /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1/0/3 0:02:05.053972

Then removed all prints to use time instead... - similar timing...

$ time python walk.py /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/A/1

real    2m18.606s
user    0m0.282s
sys 0m0.083s

Try to walk the whole plate..

$ time python walk.py /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/

...
will-moore commented 11 months ago

Ran this overnight - still not completed this morning. Cancelled it...

$ time python walk.py /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/

^CTraceback (most recent call last):
  File "/usr/lib64/python3.6/pathlib.py", line 1336, in exists
    self.stat()
  File "/usr/lib64/python3.6/pathlib.py", line 1158, in stat
    return self._accessor.stat(self)
  File "/usr/lib64/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
FileNotFoundError: [Errno 2] No such file or directory: '/bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/I/1/0/2/61/.zarray'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "walk.py", line 30, in <module>
    main(sys.argv[1:])
  File "walk.py", line 27, in main
    list(walk(p))
  File "walk.py", line 17, in walk
    yield from walk(p)
  File "walk.py", line 17, in walk
    yield from walk(p)
  File "walk.py", line 17, in walk
    yield from walk(p)
  [Previous line repeated 1 more time]
  File "walk.py", line 15, in walk
    if (p / ".zarray").exists() or (p / ".zgroup").exists():
  File "/usr/lib64/python3.6/pathlib.py", line 1336, in exists
    self.stat()
KeyboardInterrupt

real    451m23.023s
user    0m42.620s
sys     0m11.756s
will-moore commented 11 months ago

When I cancelled the script above, it was checking for .zarr/I/1/0/2/61/.zarray. Maybe if a directory does have a .zarray then we shouldn't go any deeper?

will-moore commented 11 months ago

With this walk():

def walk(path: Path) -> Generator[Tuple[Path, str, str], None, None]:
    for p in path.iterdir():
        if not p.is_dir():
            yield (p.parent, p.name, "application/octet-stream")
        else:
            if (p / ".zarray").exists() or (p / ".zgroup").exists():
                yield (p.parent, p.name, "Directory")
                # Don't try to walk zarray - will only contain chunks!
                if not (p / ".zarray").exists():
                    yield from walk(p)
            else:
                # Chunk directory
                continue

MUCH faster!

$ time python walk.py /bia-integrator-data/S-BIAD865/011c38fb-c3d0-4d1d-82d8-9147a5060d88/011c38fb-c3d0-4d1d-82d8-9147a5060d88.zarr/

real    5m6.883s
user    0m0.952s
sys     0m0.391s