IDR / idr-utils

Utility scripts for managing IDR submissions
BSD 2-Clause "Simplified" License
2 stars 6 forks source link

No check #62

Closed will-moore closed 7 months ago

will-moore commented 7 months ago

Just for testing, use --no-check to avoid connecting to IDR, so we can just focus on perf of getPlanes() on current server.

will-moore commented 7 months ago

On idr-next: lots of parallel jobs causes problems - https://github.com/IDR/idr-utils/pull/55#issuecomment-1916742257 but running on a single thread doesn't - https://github.com/IDR/idr-utils/pull/55#issuecomment-1918885238

Let's use a small number parallel threads on just omeroreadwrite-1...

[wmoore@prod120-proxy ~]$ 
[wmoore@prod120-proxy ~]$ cat nodes
omeroreadonly-1

$ screen -dmS cache parallel --eta --sshloginfile nodes -a ids_idr0016.txt -j10 '/opt/omero/server/OMERO.server/bin/omero login -s localhost -u public -w public && /opt/omero/server/venv3/bin/python /uod/idr/metadata/idr-utils/scripts/check_pixels.py --render >> /tmp/render_20240131.log'
screen -r

Computers / CPU cores / Max jobs to run
1:omeroreadonly-1 / 8 / 9

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 413 AVG: 0.00s  omeroreadonly-1:9/0/100%/0.0s 

EDIT: after 14 mins, Blitz log shows gaps of no activity for several mins, e.g. 13:38:03 -> 13:40:01 when we'd expect rendering to be happening constantly...

[wmoore@prod120-omeroreadonly-1 ~]$ tail -f /opt/omero/server/OMERO.server/var/log/Blitz-0.log
...
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Rslt:    null
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Meth:    interface omeis.providers.re.RenderingEngine.renderCompressed
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Args:    [Type: XY, z=0, t=0, renderShapes=false, shapeIds=[]]
2024-01-31 13:36:46,008 INFO  [             omeis.providers.re.Renderer] (Server-155) Using: 'omeis.providers.re.HSBStrategy' rendering strategy.
2024-01-31 13:37:05,095 DEBUG [                   loci.formats.Memoizer] (Server-158) start[1706708176678] time[48417] tag[loci.formats.Memoizer.setId]
2024-01-31 13:37:05,096 INFO  [                ome.io.nio.PixelsService] (Server-158) Creating BfPixelBuffer: /data/OMERO/ManagedRepository/demo_2/2016-06/16/04-33-36.550_mkngff/45e4cbb8-7ac6-4060-aa32-0b8f975a2894.zarr/.zattrs Series: 1920
2024-01-31 13:38:03,947 INFO  [                 org.perf4j.TimingLogger] (Server-158) start[1706708166663] time[117284] tag[omero.call.success.ome.services.RenderingBean$12.doWork]
2024-01-31 13:40:01,104 INFO  [ome.services.sessions.state.SessionCache] (2-thread-1) Synchronizing session cache. Count = 4
2024-01-31 13:40:30,257 INFO  [ ome.services.blitz.fire.SessionManagerI] (2-thread-5) Performing requestHeartbeats
2024-01-31 13:40:10,886 INFO  [        ome.services.util.ServiceHandler] (Server-158)  Rslt:    ome.io.bioformats.BfPixelBuffer@199c2b33

Just seeing first errors:

[wmoore@prod120-omeroreadonly-1 ~]$ tail -f /tmp/render_20240131.log
160/2304 Render Image:2052592 24307 [Well B14, Field 4]
161/2304 Render Image:2052593 24307 [Well B14, Field 5]
162/2304 Render Image:2052594 24307 [Well B14, Field 6]
163/2304 Render Image:2052596 24307 [Well N17, Field 1]
164/2304 Render Image:2052597 24307 [Well N17, Field 2]
165/2304 Render Image:2052598 24307 [Well N17, Field 3]
166/2304 Render Image:2052599 24307 [Well N17, Field 4]
167/2304 Render Image:2052600 24307 [Well N17, Field 5]
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed
[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /tmp/render_20240131.log
Error: RenderJpeg Image:2043403 24278 [Well I20, Field 5] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2051658 24319 [Well O23, Field 3] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2056263 24352 [Well B3, Field 6] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2047360 24304 [Well A1, Field 1] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2042440 24279 [Well I8, Field 1] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2057293 24507 [Well N9, Field 1] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2060877 24512 [Well M5, Field 6] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

Those UnknownLocalExceptions have more info when ex is raised

Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
{
    unknown = ConnectionI.cpp:2052: Ice::ConnectTimeoutException:
timeout while establishing a connection
}

But catching classes that do not inherit from BaseException is not allowed don't have any other info in that log. No errors from today in Blitz log:

[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /opt/omero/server/OMERO.server/var/log/Blitz-0.log
2024-01-30 12:41:46,012 WARN  [            ome.services.blitz.fire.Ring] (      main) Error getting uuid from node ClusterNode/5a3099a1-3ea7-4cb8-a343-104dab25066b -t -e 1.1:tcp -h 10.35.199.43 -p 37012 -t 60000:tcp -h 192.168.120.132 -p 37012 -t 60000 -- removing.
2024-01-30 12:47:12,465 WARN  [            ome.services.blitz.fire.Ring] (      main) Error getting uuid from node ClusterNode/7ace86bd-3fe0-4aeb-8af2-e224eeefd894 -t -e 1.1:tcp -h 10.35.199.43 -p 37017 -t 60000:tcp -h 192.168.120.132 -p 37017 -t 60000 -- removing.
2024-01-30 14:00:02,974 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-17) Error reaping session 7b8e7456-b98e-430d-a0b1-df092b0fdb6c from client b3bd52d6-df48-4c60-bdf0-5e665effbe78
2024-01-30 14:09:59,588 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-32) Error while creating ServiceFactoryI
2024-01-30 14:13:32,712 ERROR [ ome.services.blitz.fire.SessionManagerI] (l.Server-5) Error while creating ServiceFactoryI
2024-01-30 14:13:32,718 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-34) Error while creating ServiceFactoryI
2024-01-30 14:16:34,808 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-19) Error while creating ServiceFactoryI
2024-01-30 14:17:14,805 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-21) Error while creating ServiceFactoryI
2024-01-30 14:31:58,445 WARN  [            ome.services.blitz.fire.Ring] (2-thread-3) Error getting uuid from node ClusterNode/0d867a63-7138-4d8c-9741-5e9f87822010 -t -e 1.1:tcp -h 10.35.199.43 -p 42195 -t 60000:tcp -h 192.168.120.132 -p 42195 -t 60000 -- removing.
will-moore commented 7 months ago

It seems that the --render script stopped at the last Error above (9 Errors in total).

[wmoore@prod120-omeroreadonly-1 ~]$ tail /tmp/render_20240131.log
...
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /tmp/render_20240131.log | wc
      9     126     995

That would seem to correspond with the number of jobs running:

Computers / CPU cores / Max jobs to run
1:omeroreadonly-1 / 8 / 9

and is likely due to the fact that raising the Exception causes each job to stop after the first Error.

Total number of Images rendered before all jobs failed is

[wmoore@prod120-omeroreadonly-1 ~]$ grep "Render Image" /tmp/render_20240131.log | wc
   1332   10670   73367

Divide 1332 images between 9 jobs gives an average of 148 images per job before Error. We can see that's pretty good estimate - actually around 166 for 8 jobs and 0 for one job:

[wmoore@prod120-omeroreadonly-1 ~]$ grep -B 1 Error /tmp/render_20240131.log
167/2304 Render Image:2043403 24278 [Well I20, Field 5]
Error: RenderJpeg Image:2043403 24278 [Well I20, Field 5] exception ::Ice::UnknownLocalException
--
165/2304 Render Image:2051658 24319 [Well O23, Field 3]
Error: RenderJpeg Image:2051658 24319 [Well O23, Field 3] exception ::Ice::UnknownLocalException
--
162/2304 Render Image:2056263 24352 [Well B3, Field 6]
Error: RenderJpeg Image:2056263 24352 [Well B3, Field 6] exception ::Ice::UnknownLocalException
--
0/2304 Render Image:2047360 24304 [Well A1, Field 1]
Error: RenderJpeg Image:2047360 24304 [Well A1, Field 1] catching classes that do not inherit from BaseException is not allowed
--
163/2304 Render Image:2042440 24279 [Well I8, Field 1]
Error: RenderJpeg Image:2042440 24279 [Well I8, Field 1] exception ::Ice::UnknownLocalException
--
169/2304 Render Image:2057293 24507 [Well N9, Field 1]
Error: RenderJpeg Image:2057293 24507 [Well N9, Field 1] catching classes that do not inherit from BaseException is not allowed
--
168/2304 Render Image:2060877 24512 [Well M5, Field 6]
Error: RenderJpeg Image:2060877 24512 [Well M5, Field 6] catching classes that do not inherit from BaseException is not allowed
--
161/2304 Render Image:2047046 24297 [Well H1, Field 5]
Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
--
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

Is this now at a useful state for adding to tests etc?