IDR / idr-metadata

Curated metadata for all studies published in the Image Data Resource
https://idr.openmicroscopy.org
14 stars 24 forks source link

goofys needs remounting #671

Open will-moore opened 1 year ago

will-moore commented 1 year ago

All the BioStudies s3 data imported to idr0125-pilot is currently giving ResourceError when trying to view images. This is due to a failure of the goofys-mounted BioStudies s3 bucket.

$ ls /bia-integrator-data 
ls: cannot access /bia-integrator-data: Transport endpoint is not connected

See https://github.com/kahing/goofys/issues/208 Advice is to unmount and re-mount.

Tried un-mounting as at https://github.com/kahing/goofys/issues/77

As omero-server user:

$ fusermount -u /bia-integrator-data 
fusermount: entry for /bia-integrator-data not found in /etc/mtab

This is currently a blocker on my NGFF update work on idr0125-pilot (can use idr0138-pilot in the mean time) but it also raises questions on how to detect and fix this once we start using it on production IDR server.

cc @sbesson @joshmoore

sbesson commented 1 year ago

The logs give more clues about

Aug 31 17:01:05 pilot-idr0125-omeroreadwrite systemd: Removed slice User Slice of root.
Aug 31 17:03:44 pilot-idr0125-omeroreadwrite kernel: gunicorn invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: gunicorn cpuset=/ mems_allowed=0
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: CPU: 14 PID: 29050 Comm: gunicorn Kdump: loaded Not tainted 3.10.0-1160.45.1.el7.x86_64 #1
...
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: Out of memory: Kill process 12081 (goofys) score 556 or sacrifice child
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: Killed process 12081 (goofys), UID 0, total-vm:37768132kB, anon-rss:37656340kB, file-rss:0kB, shmem-rss:0

The mount process was killed as the system ran out of memory. In perspective of using this production, I agree this should definitely be reported asap. Unless someone suggests a different option, I'll look into adding a monitoring endpoint.

will-moore commented 1 year ago

Yes, monitoring would be great, thanks.

In the meantime, how to I unmount/remount to get up and running again?

sbesson commented 1 year ago

In general

sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other <bucket> <mount_point>

is the way to mount the bucket

will-moore commented 1 year ago

I previously had the bucket mounted with:

$ sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data

But now I think I need to unmount before remounting.

If I try to run that mounting command I get:

2023/09/05 11:49:40.719284 main.FATAL Unable to mount file system, see syslog for details

Not sure where to see syslog, but I guess it fails because it's already mounted.

sbesson commented 1 year ago

Agreed, and this might be a side-effect of the way the process was terminated. Does the following work?

$ sudo umount /bia-integrator-data
$ sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
will-moore commented 1 year ago
$ sudo umount /bia-integrator-data
umount: /bia-integrator-data: target is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

$ ps -aux | grep goof
root      1998  0.4  2.7 1867300 1788576 ?     Ssl   2022 1830:24 /opt/goofys -o allow_other cellpainting-gallery /cellpainting-gallery/
root      2569  0.0  0.0 123156 20740 ?        Ssl  Apr27  59:58 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0033 /idr0033
root      4754  0.0  0.0 123220 30152 ?        Ssl  Jun26  30:11 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0011 /idr0011
root     14265  0.2  0.1 889288 109132 ?       Ssl  Apr06 521:39 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0012 /idr0012
root     14530  0.0  0.0 122452 10676 ?        Ssl  Jun27  20:22 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0025 /idr0025
root     14835  0.0  0.0 400744 52036 ?        Ssl  Apr06 200:30 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0010 /idr0010
wmoore   16387  0.0  0.0 112816   976 pts/0    S+   13:51   0:00 grep --color=auto goof
root     18524  0.0  0.0 120916  5448 ?        Ssl  May01  13:35 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0036 /idr0036
root     20978  0.0  0.0 192376 57776 ?        Ssl  Jun16  59:01 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0035 /idr0035
root     23744  0.0  0.0 120916  6444 ?        Ssl  Aug24   1:30 /opt/goofys --endpoint https://hl.fire.sdo.ebi.ac.uk/ -o allow_other biostudies-public /biostudies-public
root     26006  0.0  0.0 122708  7324 ?        Ssl  Jun27  17:33 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0026 /idr0026
root     31438  0.0  0.0 122260 14316 ?        Ssl  May18  26:01 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0054 /idr0054
sbesson commented 1 year ago

A few thoughts: 1- try to pass -f to force unmount 2- try to stop the processes that might be accessing some of the data e.g. omero-server

will-moore commented 1 year ago

Good thoughts @sbesson!

sudo service omero-server stop
sudo umount /bia-integrator-data
sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
ls /bia-integrator-data/
# working!
sudo service omero-server start

No -f needed! Can view NGFF images again on idr0125-pilot! 👍

will-moore commented 1 year ago

Seen again today on idr-testing:

OSError: [Errno 107] Transport endpoint is not connected: '/bia-integrator-data/S-BIAD865/3edb1d3a-91da-48a9-b6a4-592328ea5f1c/3edb1d3a-91da-48a9-b6a4-592328ea5f1c.zarr'

Fixed as above

will-moore commented 1 year ago

Failed again on idr-testing.. Blitz log

(venv3) bash-4.2$ grep ".251_mkngff/04c70c80" /opt/omero/server/OMERO.server/var/log/Blitz-0.log
2023-09-20 12:26:39,561 INFO  [        ome.services.util.ServiceHandler] (l.Server-3)  Rslt:    ([demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, .zattrs, unknown], [demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, .zgroup, unknown], [demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, A, unknown], ... 3875 more)
2023-09-20 12:26:52,463 INFO  [      ome.services.OmeroFilePathResolver] (l.Server-7) Metadata only file, resulting path: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:26:52,936 INFO  [                loci.formats.ImageReader] (l.Server-7) ZarrReader initializing /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:26:53,156 ERROR [              loci.formats.FormatHandler] (l.Server-7) ZarrReader attempting to initialize file: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:37:18,975 ERROR [         ome.io.bioformats.BfPixelBuffer] (l.Server-7) Failed to instantiate BfPixelsWrapper with /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:37:18,978 ERROR [                ome.io.nio.PixelsService] (l.Server-7) Error instantiating pixel buffer: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
java.lang.RuntimeException: java.io.UncheckedIOException: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
Caused by: java.io.UncheckedIOException: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
Caused by: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
(venv3) bash-4.2$ ls -alh /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0
ls: cannot access /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Transport endpoint is not connected
(venv3) bash-4.2$ ls /bia-i
bia-idr/             bia-integrator-data  
(venv3) bash-4.2$ ls /bia-integrator-data/
ls: cannot access /bia-integrator-data/: Transport endpoint is not connected
will-moore commented 1 year ago

Will try adding --debug_s3 now and -f to run in foreground, piping stderr to log...

screen -S goofys
sudo umount /bia-integrator-data
sudo /opt/goofys --debug_s3 --endpoint https://uk1s3.embassy.ebi.ac.uk/ -f -o allow_other bia-integrator-data /bia-integrator-data 2> goofys.log
will-moore commented 10 months ago

Since goofys mount has been stable through all the memo file generation and check_pixels.py testing, this is looking good.

Still an open question of whether we need any additional monitoring of the goofys mount or if other monitoring would pick this up (e.g. if images can't be viewed, error logs etc)?

sbesson commented 10 months ago

Externally, it should be possible to extend https://github.com/IDR/upptime to monitor specific endpoints testing images available via goofys e.g. https://github.com/IDR/upptime/blob/9932b00f1b53c1e0aa53ac1ed6c715bde2fc49cc/.upptimerc.yml#L84-L85

Internally, the best strategy would be to make use Prometheus e.g. by defining custom alert rules like https://github.com/IDR/deployment/blob/368051656c541d315c3cc77b59762f964abf972a/ansible/idr-ftp-monitoring.yml#L23.

will-moore commented 8 months ago

To compare goofys and geesefs (similar to benchmark test at https://github.com/yandex-cloud/geesefs/blob/master/bench/README.md#goofys-tests), lets compare speed of getPixels() with https://github.com/IDR/idr-utils/pull/55/commits/1f4c0bacfdf433f2d074c4079f66841ccca3149f using goofys and then with geesefs.

Let's pick a couple of different idr0011 plates each time... Check a single plane at a time

On idr-testing...

python check_pixels.py Plate:5394 --max-planes=sizeC --timing > ~/check_pix_timing_idr0011_5394sizeC.log

bash-4.2$ grep Ratio check_pix_timing_idr0011_5394sizeC.log
Ratio of local/IDR timing for 3 planes is 9.584544730914489 Image: 2850276
Ratio of local/IDR timing for 3 planes is 8.825775907082182 Image: 2850277
Ratio of local/IDR timing for 3 planes is 6.716267716681221 Image: 2850278
Ratio of local/IDR timing for 3 planes is 7.901181375504938 Image: 2850279
Ratio of local/IDR timing for 3 planes is 6.4885706177715745 Image: 2850280
Ratio of local/IDR timing for 3 planes is 6.728097131051076 Image: 2850281
Ratio of local/IDR timing for 3 planes is 7.33852890229974 Image: 2850282
...

Install geesefs

(base) [wmoore@test120-omeroreadwrite ~]$ wget https://github.com/yandex-cloud/geesefs/releases/latest/download/geesefs-linux-amd64
(base) [wmoore@test120-omeroreadwrite ~]$ chmod +x geesefs-linux-amd64
(base) [wmoore@test120-omeroreadwrite ~]$ mv geesefs-linux-amd64 geesefs
(base) [wmoore@test120-omeroreadwrite ~]$ sudo cp geesefs /usr/local/bin
(base) [wmoore@test120-omeroreadwrite ~]$ which geesefs
/usr/local/bin/geesefs

Test mount a bucket...

(base) [wmoore@test120-omeroreadwrite ~]$ sudo mkdir /bia-integrator-data-geesefs
(base) [wmoore@test120-omeroreadwrite ~]$ sudo /usr/local/bin/geesefs --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data-geesefs
s3.INFO anonymous bucket detected
main.INFO File system has been successfully mounted.

(base) [wmoore@test120-omeroreadwrite ~]$ ls /bia-integrator-data-geesefs/S-BIAD865/3edb1d3a-91da-48a9-b6a4-592328ea5f1c/3edb1d3a-91da-48a9-b6a4-592328ea5f1c.zarr
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  OME  P

Now replace goofys mount...

(base) [wmoore@test120-omeroreadwrite ~]$ sudo umount /bia-integrator-data
(base) [wmoore@test120-omeroreadwrite ~]$ sudo /usr/local/bin/geesefs --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
s3.INFO anonymous bucket detected
main.INFO File system has been successfully mounted.

Now repeat testing above... new plate...

python check_pixels.py Plate:5535 --max-planes=sizeC --timing >> ~/check_pix_timing_idr0011_5535sizeC.log

bash-4.2$ grep Ratio check_pix_timing_idr0011_5535sizeC.log 
Ratio of local/IDR timing for 3 planes is 20.961269464238587 Image: 2854786
Ratio of local/IDR timing for 3 planes is 16.9034041386036 Image: 2854787
Ratio of local/IDR timing for 3 planes is 21.04661057800459 Image: 2854788
Ratio of local/IDR timing for 3 planes is 21.61543261880246 Image: 2854789
Ratio of local/IDR timing for 3 planes is 21.537431002389557 Image: 2854790
Ratio of local/IDR timing for 3 planes is 7.2759874986868365 Image: 2854791
Ratio of local/IDR timing for 3 planes is 6.553177521825434 Image: 2854792
Ratio of local/IDR timing for 3 planes is 8.277165877939971 Image: 2854793
Ratio of local/IDR timing for 3 planes is 6.408577581678748 Image: 2854794
Ratio of local/IDR timing for 3 planes is 8.47382281509996 Image: 2854795
...

Conclusion - performance is comparable between goofys and geesefs.

joshmoore commented 8 months ago

:+1: on similar performance. A follow up question might be whether or not it handles large numbers, though.

will-moore commented 8 months ago

@joshmoore Good point....

I tried the figure workflow on idr-testing using idr0090 NGFF images with similar results as before, then remembered that all the readonly servers are still using goofys...

So, for each of the readonly servers on idr-testing, I installed geesefs and replaced /bia-integrator-data mount as above to use geesefs.

Then tested with idr-testing...

I still managed to get the server quite unhappy, but it felt like it lasted several times longer with geesefs.

will-moore commented 8 months ago

Switched idr-testing back to using goofys...

[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo umount /bia-integrator-data"; done
[wmoore@test120-proxy ~]$ 
[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo /usr/bin/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data"; done
[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo service omero-server restart"; done
Redirecting to /bin/systemctl restart omero-server.service
Redirecting to /bin/systemctl restart omero-server.service

Message from syslogd@localhost at Feb  1 11:49:48 ...
 haproxy[19025]: backend omero4064-1 has no server available!
Redirecting to /bin/systemctl restart omero-server.service
Redirecting to /bin/systemctl restart omero-server.service

Message from syslogd@localhost at Feb  1 11:52:04 ...
 haproxy[19025]: backend omero4064-0 has no server available!
Redirecting to /bin/systemctl restart omero-server.service