Open will-moore opened 1 year ago
The logs give more clues about
Aug 31 17:01:05 pilot-idr0125-omeroreadwrite systemd: Removed slice User Slice of root.
Aug 31 17:03:44 pilot-idr0125-omeroreadwrite kernel: gunicorn invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: gunicorn cpuset=/ mems_allowed=0
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: CPU: 14 PID: 29050 Comm: gunicorn Kdump: loaded Not tainted 3.10.0-1160.45.1.el7.x86_64 #1
...
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: Out of memory: Kill process 12081 (goofys) score 556 or sacrifice child
Aug 31 17:03:45 pilot-idr0125-omeroreadwrite kernel: Killed process 12081 (goofys), UID 0, total-vm:37768132kB, anon-rss:37656340kB, file-rss:0kB, shmem-rss:0
The mount process was killed as the system ran out of memory. In perspective of using this production, I agree this should definitely be reported asap. Unless someone suggests a different option, I'll look into adding a monitoring endpoint.
Yes, monitoring would be great, thanks.
In the meantime, how to I unmount/remount to get up and running again?
In general
sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other <bucket> <mount_point>
is the way to mount the bucket
I previously had the bucket mounted with:
$ sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
But now I think I need to unmount before remounting.
If I try to run that mounting command I get:
2023/09/05 11:49:40.719284 main.FATAL Unable to mount file system, see syslog for details
Not sure where to see syslog
, but I guess it fails because it's already mounted.
Agreed, and this might be a side-effect of the way the process was terminated. Does the following work?
$ sudo umount /bia-integrator-data
$ sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
$ sudo umount /bia-integrator-data
umount: /bia-integrator-data: target is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
$ ps -aux | grep goof
root 1998 0.4 2.7 1867300 1788576 ? Ssl 2022 1830:24 /opt/goofys -o allow_other cellpainting-gallery /cellpainting-gallery/
root 2569 0.0 0.0 123156 20740 ? Ssl Apr27 59:58 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0033 /idr0033
root 4754 0.0 0.0 123220 30152 ? Ssl Jun26 30:11 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0011 /idr0011
root 14265 0.2 0.1 889288 109132 ? Ssl Apr06 521:39 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0012 /idr0012
root 14530 0.0 0.0 122452 10676 ? Ssl Jun27 20:22 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0025 /idr0025
root 14835 0.0 0.0 400744 52036 ? Ssl Apr06 200:30 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0010 /idr0010
wmoore 16387 0.0 0.0 112816 976 pts/0 S+ 13:51 0:00 grep --color=auto goof
root 18524 0.0 0.0 120916 5448 ? Ssl May01 13:35 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0036 /idr0036
root 20978 0.0 0.0 192376 57776 ? Ssl Jun16 59:01 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0035 /idr0035
root 23744 0.0 0.0 120916 6444 ? Ssl Aug24 1:30 /opt/goofys --endpoint https://hl.fire.sdo.ebi.ac.uk/ -o allow_other biostudies-public /biostudies-public
root 26006 0.0 0.0 122708 7324 ? Ssl Jun27 17:33 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0026 /idr0026
root 31438 0.0 0.0 122260 14316 ? Ssl May18 26:01 /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other idr0054 /idr0054
A few thoughts:
1- try to pass -f
to force unmount
2- try to stop the processes that might be accessing some of the data e.g. omero-server
Good thoughts @sbesson!
sudo service omero-server stop
sudo umount /bia-integrator-data
sudo /opt/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
ls /bia-integrator-data/
# working!
sudo service omero-server start
No -f
needed!
Can view NGFF images again on idr0125-pilot! 👍
Seen again today on idr-testing
:
OSError: [Errno 107] Transport endpoint is not connected: '/bia-integrator-data/S-BIAD865/3edb1d3a-91da-48a9-b6a4-592328ea5f1c/3edb1d3a-91da-48a9-b6a4-592328ea5f1c.zarr'
Fixed as above
Failed again on idr-testing.. Blitz log
(venv3) bash-4.2$ grep ".251_mkngff/04c70c80" /opt/omero/server/OMERO.server/var/log/Blitz-0.log
2023-09-20 12:26:39,561 INFO [ ome.services.util.ServiceHandler] (l.Server-3) Rslt: ([demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, .zattrs, unknown], [demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, .zgroup, unknown], [demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/, A, unknown], ... 3875 more)
2023-09-20 12:26:52,463 INFO [ ome.services.OmeroFilePathResolver] (l.Server-7) Metadata only file, resulting path: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:26:52,936 INFO [ loci.formats.ImageReader] (l.Server-7) ZarrReader initializing /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:26:53,156 ERROR [ loci.formats.FormatHandler] (l.Server-7) ZarrReader attempting to initialize file: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:37:18,975 ERROR [ ome.io.bioformats.BfPixelBuffer] (l.Server-7) Failed to instantiate BfPixelsWrapper with /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
2023-09-20 12:37:18,978 ERROR [ ome.io.nio.PixelsService] (l.Server-7) Error instantiating pixel buffer: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/OME/METADATA.ome.xml
java.lang.RuntimeException: java.io.UncheckedIOException: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
Caused by: java.io.UncheckedIOException: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
Caused by: java.nio.file.FileSystemException: /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Software caused connection abort
(venv3) bash-4.2$ ls -alh /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0
ls: cannot access /data/OMERO/ManagedRepository/demo_2/2016-04/30/15-54-22.251_mkngff/04c70c80-bc2e-4210-a21f-d2f02108b829.zarr/M/17/0/0: Transport endpoint is not connected
(venv3) bash-4.2$ ls /bia-i
bia-idr/ bia-integrator-data
(venv3) bash-4.2$ ls /bia-integrator-data/
ls: cannot access /bia-integrator-data/: Transport endpoint is not connected
Will try adding --debug_s3
now and -f
to run in foreground, piping stderr to log...
screen -S goofys
sudo umount /bia-integrator-data
sudo /opt/goofys --debug_s3 --endpoint https://uk1s3.embassy.ebi.ac.uk/ -f -o allow_other bia-integrator-data /bia-integrator-data 2> goofys.log
Since goofys mount has been stable through all the memo file generation and check_pixels.py testing, this is looking good.
Still an open question of whether we need any additional monitoring of the goofys mount or if other monitoring would pick this up (e.g. if images can't be viewed, error logs etc)?
Externally, it should be possible to extend https://github.com/IDR/upptime to monitor specific endpoints testing images available via goofys
e.g. https://github.com/IDR/upptime/blob/9932b00f1b53c1e0aa53ac1ed6c715bde2fc49cc/.upptimerc.yml#L84-L85
Internally, the best strategy would be to make use Prometheus e.g. by defining custom alert rules like https://github.com/IDR/deployment/blob/368051656c541d315c3cc77b59762f964abf972a/ansible/idr-ftp-monitoring.yml#L23.
To compare goofys and geesefs (similar to benchmark test at https://github.com/yandex-cloud/geesefs/blob/master/bench/README.md#goofys-tests), lets compare speed of getPixels()
with https://github.com/IDR/idr-utils/pull/55/commits/1f4c0bacfdf433f2d074c4079f66841ccca3149f using goofys and then with geesefs.
Let's pick a couple of different idr0011 plates each time... Check a single plane at a time
On idr-testing
...
python check_pixels.py Plate:5394 --max-planes=sizeC --timing > ~/check_pix_timing_idr0011_5394sizeC.log
bash-4.2$ grep Ratio check_pix_timing_idr0011_5394sizeC.log
Ratio of local/IDR timing for 3 planes is 9.584544730914489 Image: 2850276
Ratio of local/IDR timing for 3 planes is 8.825775907082182 Image: 2850277
Ratio of local/IDR timing for 3 planes is 6.716267716681221 Image: 2850278
Ratio of local/IDR timing for 3 planes is 7.901181375504938 Image: 2850279
Ratio of local/IDR timing for 3 planes is 6.4885706177715745 Image: 2850280
Ratio of local/IDR timing for 3 planes is 6.728097131051076 Image: 2850281
Ratio of local/IDR timing for 3 planes is 7.33852890229974 Image: 2850282
...
Install geesefs
(base) [wmoore@test120-omeroreadwrite ~]$ wget https://github.com/yandex-cloud/geesefs/releases/latest/download/geesefs-linux-amd64
(base) [wmoore@test120-omeroreadwrite ~]$ chmod +x geesefs-linux-amd64
(base) [wmoore@test120-omeroreadwrite ~]$ mv geesefs-linux-amd64 geesefs
(base) [wmoore@test120-omeroreadwrite ~]$ sudo cp geesefs /usr/local/bin
(base) [wmoore@test120-omeroreadwrite ~]$ which geesefs
/usr/local/bin/geesefs
Test mount a bucket...
(base) [wmoore@test120-omeroreadwrite ~]$ sudo mkdir /bia-integrator-data-geesefs
(base) [wmoore@test120-omeroreadwrite ~]$ sudo /usr/local/bin/geesefs --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data-geesefs
s3.INFO anonymous bucket detected
main.INFO File system has been successfully mounted.
(base) [wmoore@test120-omeroreadwrite ~]$ ls /bia-integrator-data-geesefs/S-BIAD865/3edb1d3a-91da-48a9-b6a4-592328ea5f1c/3edb1d3a-91da-48a9-b6a4-592328ea5f1c.zarr
A B C D E F G H I J K L M N O OME P
Now replace goofys mount...
(base) [wmoore@test120-omeroreadwrite ~]$ sudo umount /bia-integrator-data
(base) [wmoore@test120-omeroreadwrite ~]$ sudo /usr/local/bin/geesefs --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data
s3.INFO anonymous bucket detected
main.INFO File system has been successfully mounted.
Now repeat testing above... new plate...
python check_pixels.py Plate:5535 --max-planes=sizeC --timing >> ~/check_pix_timing_idr0011_5535sizeC.log
bash-4.2$ grep Ratio check_pix_timing_idr0011_5535sizeC.log
Ratio of local/IDR timing for 3 planes is 20.961269464238587 Image: 2854786
Ratio of local/IDR timing for 3 planes is 16.9034041386036 Image: 2854787
Ratio of local/IDR timing for 3 planes is 21.04661057800459 Image: 2854788
Ratio of local/IDR timing for 3 planes is 21.61543261880246 Image: 2854789
Ratio of local/IDR timing for 3 planes is 21.537431002389557 Image: 2854790
Ratio of local/IDR timing for 3 planes is 7.2759874986868365 Image: 2854791
Ratio of local/IDR timing for 3 planes is 6.553177521825434 Image: 2854792
Ratio of local/IDR timing for 3 planes is 8.277165877939971 Image: 2854793
Ratio of local/IDR timing for 3 planes is 6.408577581678748 Image: 2854794
Ratio of local/IDR timing for 3 planes is 8.47382281509996 Image: 2854795
...
Conclusion - performance is comparable between goofys and geesefs.
:+1: on similar performance. A follow up question might be whether or not it handles large numbers, though.
@joshmoore Good point....
I tried the figure
workflow on idr-testing using idr0090 NGFF images with similar results as before, then remembered that all the readonly servers are still using goofys...
So, for each of the readonly servers on idr-testing, I installed geesefs and replaced /bia-integrator-data
mount as above to use geesefs.
Then tested with idr-testing...
I still managed to get the server quite unhappy, but it felt like it lasted several times longer with geesefs
.
Switched idr-testing back to using goofys...
[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo umount /bia-integrator-data"; done
[wmoore@test120-proxy ~]$
[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo /usr/bin/goofys --endpoint https://uk1s3.embassy.ebi.ac.uk/ -o allow_other bia-integrator-data /bia-integrator-data"; done
[wmoore@test120-proxy ~]$ for server in omeroreadwrite omeroreadonly-1 omeroreadonly-2 omeroreadonly-3 omeroreadonly-4; do ssh $server "sudo service omero-server restart"; done
Redirecting to /bin/systemctl restart omero-server.service
Redirecting to /bin/systemctl restart omero-server.service
Message from syslogd@localhost at Feb 1 11:49:48 ...
haproxy[19025]: backend omero4064-1 has no server available!
Redirecting to /bin/systemctl restart omero-server.service
Redirecting to /bin/systemctl restart omero-server.service
Message from syslogd@localhost at Feb 1 11:52:04 ...
haproxy[19025]: backend omero4064-0 has no server available!
Redirecting to /bin/systemctl restart omero-server.service
All the BioStudies s3 data imported to
idr0125-pilot
is currently givingResourceError
when trying to view images. This is due to a failure of the goofys-mounted BioStudies s3 bucket.See https://github.com/kahing/goofys/issues/208 Advice is to unmount and re-mount.
Tried un-mounting as at https://github.com/kahing/goofys/issues/77
As
omero-server
user:This is currently a blocker on my NGFF update work on
idr0125-pilot
(can use idr0138-pilot in the mean time) but it also raises questions on how to detect and fix this once we start using it on production IDR server.cc @sbesson @joshmoore