easifish tutorial: issue with unknown kwargs

HomoPolyethylen commented 11 months ago

when trying to run the easi-fish tutorial, I ran into a runtime error in easifish_registration_pipeline (application_pipelines.py:230):

TypeError: Server.__init__() got an unexpected keyword argument 'ncpus'

The error gets thrown at line 230.

 # if no cluster was given, make one then run on it
    if cluster is None:
        with cluster_constructor(**{**c, **cluster_kwargs}) as cluster:    # <--- this is line 230
            deform = alignment(cluster)
            aligned = resample(cluster)

It seems c is the problem. When commenting out the 'ncpus':1, I get the same error with the next kwarg threads, and then with min_workers, and then with max_workers. config seems to be fine.

    c = {'ncpus':1,
         'threads':1,
         'min_workers':10,
         'max_workers':100,
         'config':{
             'distributed.worker.memory.target':0.9,
             'distributed.worker.memory.spill':0.9,
             'distributed.worker.memory.pause':0.9,
         },

commenting out / removing the items ncpus, threads, min_workers, max_workers or removing c from line 230 resolves the issue.

Is that a known Issue? how important are these kwargs?

GFleishman commented 11 months ago

Hi again!

I'm sorry for this, but the tutorials are simply not always up to date and not always applicable in all scenarios. Thanks for asking for help here since its good to clarify this stuff and have it documented somewhere.

Your c dictionary, which I typically call cluster_kwargs, is meant to contain parameters that control how and where the distributed computations will be executed. That is, things like the number of workers that will be created and what resources each of those workers will have. You can set things like the number of cpus a worker has access to, and how many threads dask can submit to that worker at at time.

The c dictionary you show here is meant to specify values for the janelia_lsf_cluster object defined here. However, you're not at Janelia, and you're probably running on a workstation, so bigstream has decided to make a local_cluster object instead. So your c dictionary should only contain arguments that can be passed to the localcluster constructor (__init_\ function). Local clusters are usually easier, dask has access to all the resources on the machine and can figure out how to make workers itself, so you don't need to specify much.

If you're processing large sample EASI-FISH data however, you will need to think about the total size of your data and the resources available on your machine. Bigstream tries hard to make big data problems possible on workstations, but that could mean tying up resources on your computer for a long time, like days. It's always best to get a smaller test case working first before committing to a huge computation on a small machine; like a 1024x1024x512 crop of your image, that you will treat as 4x4x2 blocks of size 256.

If you do have access to a cluster, but not an lsf cluster like we have at Janelia, then you may want to consider implementing your own cluster object using dask-jobqueue Here is an example of a SLURM cluster object others have done in this example (see line 322 here).

HomoPolyethylen commented 11 months ago

Hello and thanks for the quick and thorough response! So far, I am merely trying to run the test images you provide and then our own test set. So I am using my local machine (with limited resources, but I would guess enough for a test run).

After deleting the janelia-specific kwargs, it runs and eats up all memory until it crashes. Is there a kwarg to limit the memory? does this mean, that I have less then the minimally required memory? And (last question, sorry): are the kwargs documented somewhere? (these get passed to dask, right?)

GFleishman commented 11 months ago

Happy to help. Running the test data, and then your own small tests, are a good way to start. The test data included in the package should be small enough to run on typical workstations or even an average laptop.

The cluster_kwargs are just arguments to this class: https://github.com/GFleishman/ClusterWrap/blob/ecdcb7a419ff7261a1e7ebc5c88b3103e4b2abc3/ClusterWrap/clusters.py#L176

That class is a very thin wrapper around the dask.distributed.LocalCluster object, and you can see that kwargs are passed down to it. So your cluster_kwargs dictionary can contain any of the arguments documented here: https://distributed.dask.org/en/latest/api.html#distributed.LocalCluster It sounds like the memory_limit argument might be an important one for you to set well.

Notice also that you can set any of the global dask configuration values in the cluster_kwargs if you need. That's any value you can find here: https://docs.dask.org/en/latest/configuration.html#configuration-reference You shouldn't have to mess around with those too much at first, but just so you know that they are there.

So to have 4 workers processing the data in parallel, and each worker having access to 20% of the total system memory, you could have something like this:

cluster_kwargs = {
    'n_workers':4,
    'memory_limit':0.2,
}

And if you need to set any of the dask global configuration options (which admittedly there are many and not very transparent what they do, but eventually its helpful to know how some of them work), then you can add:

cluster_kwargs = {
    'n_workers':4,
    'memory_limit':0.2,
    'config':{
        'temporary-directory':'/path/to/where/you/want/cache/and/temporary/files/written/by/dask',
    }
}

Just setting the temporary-directory there as an example, but anything in that dask-config reference is a valid option in the config dictionary.

GFleishman commented 11 months ago

Last - I'm actively working on bigstream every day, so to keep up with any major changes to the repository or issues resolved or anything like that click the star button for the repo. It also helps me keep track of who is using the software so I can reach out with updates, improvements, and also show my own colleagues and funding sources how the software is being adopted.

GFleishman commented 11 months ago

If its alright with you I'm going to close this issue since I think sharing documentation on how to use the cluster_kwargs dictionary for various compute environments was what you needed.

HomoPolyethylen commented 11 months ago

thanks a lot! using your suggested 'n_workers':4, 'memory_limit':0.2, the process does not seem do get killed by the os anymore. However, there still is a memory issue. It seems the problem of memory allocation remains, just now the nanny kills the workers, instead of the os the entire process...

stdout

``` Run ransac {'blob_sizes': [6, 20]} Fix spots: 8172 Moving spots: 6584 Found enough spots to estimate the affine fix: 1080 , moving: 1080 Run affine {'shrink_factors': (2,), 'smooth_sigmas': (2.5,), 'optimizer_args': {'learningRate': 0.25, 'minStep': 0.0, 'numberOfIterations': 400}} LEVEL: 0 ITERATION: 0 METRIC: -0.4581419604727837 LEVEL: 0 ITERATION: 1 METRIC: -0.46185173508006006 LEVEL: 0 ITERATION: 2 METRIC: -0.4650156818305996 LEVEL: 0 ITERATION: 3 METRIC: -0.4724639649044412 LEVEL: 0 ITERATION: 4 METRIC: -0.4746615064736247 LEVEL: 0 ITERATION: 5 METRIC: -0.47997929246244164 LEVEL: 0 ITERATION: 6 METRIC: -0.48162338752752715 LEVEL: 0 ITERATION: 7 METRIC: -0.48310909958720233 LEVEL: 0 ITERATION: 8 METRIC: -0.4837360926167621 LEVEL: 0 ITERATION: 9 METRIC: -0.4845070410365141 LEVEL: 0 ITERATION: 10 METRIC: -0.4853417438750474 LEVEL: 0 ITERATION: 11 METRIC: -0.4861468600349893 LEVEL: 0 ITERATION: 12 METRIC: -0.4869357008039241 LEVEL: 0 ITERATION: 13 METRIC: -0.48770793124780737 LEVEL: 0 ITERATION: 14 METRIC: -0.4884545858080386 LEVEL: 0 ITERATION: 15 METRIC: -0.48918206906538847 LEVEL: 0 ITERATION: 16 METRIC: -0.4899023849128928 LEVEL: 0 ITERATION: 17 METRIC: -0.4905931877795264 LEVEL: 0 ITERATION: 18 METRIC: -0.49127124891444063 LEVEL: 0 ITERATION: 19 METRIC: -0.4919326641657991 LEVEL: 0 ITERATION: 20 METRIC: -0.49257102820812987 LEVEL: 0 ITERATION: 21 METRIC: -0.4931839567895838 LEVEL: 0 ITERATION: 22 METRIC: -0.49378765987179096 LEVEL: 0 ITERATION: 23 METRIC: -0.49437435058566226 LEVEL: 0 ITERATION: 24 METRIC: -0.49494681942904467 LEVEL: 0 ITERATION: 25 METRIC: -0.4955013797582309 LEVEL: 0 ITERATION: 26 METRIC: -0.496047288487377 LEVEL: 0 ITERATION: 27 METRIC: -0.496568052816728 LEVEL: 0 ITERATION: 28 METRIC: -0.49706922275872795 LEVEL: 0 ITERATION: 29 METRIC: -0.4975690755774044 LEVEL: 0 ITERATION: 30 METRIC: -0.49806048889227034 LEVEL: 0 ITERATION: 31 METRIC: -0.4985324356268822 LEVEL: 0 ITERATION: 32 METRIC: -0.4989858475626363 LEVEL: 0 ITERATION: 33 METRIC: -0.4994252715046777 LEVEL: 0 ITERATION: 34 METRIC: -0.49986243995316687 LEVEL: 0 ITERATION: 35 METRIC: -0.5002695242474917 LEVEL: 0 ITERATION: 36 METRIC: -0.5006717598982761 LEVEL: 0 ITERATION: 37 METRIC: -0.5010587437426598 LEVEL: 0 ITERATION: 38 METRIC: -0.5014281548923906 LEVEL: 0 ITERATION: 39 METRIC: -0.5017930052510243 LEVEL: 0 ITERATION: 40 METRIC: -0.5021445185715844 LEVEL: 0 ITERATION: 41 METRIC: -0.5024947157532571 LEVEL: 0 ITERATION: 42 METRIC: -0.5028207059244102 LEVEL: 0 ITERATION: 43 METRIC: -0.5031372682895622 LEVEL: 0 ITERATION: 44 METRIC: -0.5034437579646844 LEVEL: 0 ITERATION: 45 METRIC: -0.503743909461261 LEVEL: 0 ITERATION: 46 METRIC: -0.5040461875086483 LEVEL: 0 ITERATION: 47 METRIC: -0.5043177381748524 LEVEL: 0 ITERATION: 48 METRIC: -0.5045853248171887 LEVEL: 0 ITERATION: 49 METRIC: -0.5048496952355569 LEVEL: 0 ITERATION: 50 METRIC: -0.5051156617514692 LEVEL: 0 ITERATION: 51 METRIC: -0.5053605919755874 LEVEL: 0 ITERATION: 52 METRIC: -0.5056060504656191 LEVEL: 0 ITERATION: 53 METRIC: -0.5058312688263044 LEVEL: 0 ITERATION: 54 METRIC: -0.5060474193135633 LEVEL: 0 ITERATION: 55 METRIC: -0.5062694684784214 LEVEL: 0 ITERATION: 56 METRIC: -0.5064751121519155 LEVEL: 0 ITERATION: 57 METRIC: -0.5066792230216223 LEVEL: 0 ITERATION: 58 METRIC: -0.5068751869693193 LEVEL: 0 ITERATION: 59 METRIC: -0.5070695888245187 LEVEL: 0 ITERATION: 60 METRIC: -0.5072388482968267 LEVEL: 0 ITERATION: 61 METRIC: -0.5074144847017321 LEVEL: 0 ITERATION: 62 METRIC: -0.5075198276353228 LEVEL: 0 ITERATION: 63 METRIC: -0.50768240549795 LEVEL: 0 ITERATION: 64 METRIC: -0.5077227248678277 LEVEL: 0 ITERATION: 65 METRIC: -0.5078342072736535 LEVEL: 0 ITERATION: 66 METRIC: -0.5079034927879992 LEVEL: 0 ITERATION: 67 METRIC: -0.5079739747202662 LEVEL: 0 ITERATION: 68 METRIC: -0.5080430542376165 LEVEL: 0 ITERATION: 69 METRIC: -0.5081134203488384 LEVEL: 0 ITERATION: 70 METRIC: -0.5081774806039373 LEVEL: 0 ITERATION: 71 METRIC: -0.5082466558220115 LEVEL: 0 ITERATION: 72 METRIC: -0.5083135122591875 LEVEL: 0 ITERATION: 73 METRIC: -0.5083821598539106 LEVEL: 0 ITERATION: 74 METRIC: -0.5084394701495869 LEVEL: 0 ITERATION: 75 METRIC: -0.5084981685382055 LEVEL: 0 ITERATION: 76 METRIC: -0.5085490525356509 LEVEL: 0 ITERATION: 77 METRIC: -0.5086023035425143 LEVEL: 0 ITERATION: 78 METRIC: -0.5086574992694624 LEVEL: 0 ITERATION: 79 METRIC: -0.5087085839545603 LEVEL: 0 ITERATION: 80 METRIC: -0.5087576982524257 LEVEL: 0 ITERATION: 81 METRIC: -0.5088074869079925 LEVEL: 0 ITERATION: 82 METRIC: -0.5088526680721442 LEVEL: 0 ITERATION: 83 METRIC: -0.5089088951493157 LEVEL: 0 ITERATION: 84 METRIC: -0.5089517922795461 LEVEL: 0 ITERATION: 85 METRIC: -0.5090037517402162 LEVEL: 0 ITERATION: 86 METRIC: -0.5090403812749982 LEVEL: 0 ITERATION: 87 METRIC: -0.509084437510761 LEVEL: 0 ITERATION: 88 METRIC: -0.5091101356593934 LEVEL: 0 ITERATION: 89 METRIC: -0.5091576075825929 LEVEL: 0 ITERATION: 90 METRIC: -0.5091674538238428 LEVEL: 0 ITERATION: 91 METRIC: -0.5092142034516426 LEVEL: 0 ITERATION: 92 METRIC: -0.5092161895846403 LEVEL: 0 ITERATION: 93 METRIC: -0.5092692290968145 LEVEL: 0 ITERATION: 94 METRIC: -0.5092657332979338 LEVEL: 0 ITERATION: 95 METRIC: -0.5093147763734823 LEVEL: 0 ITERATION: 96 METRIC: -0.5093185071494335 LEVEL: 0 ITERATION: 97 METRIC: -0.5093294135759087 LEVEL: 0 ITERATION: 98 METRIC: -0.5093451798797414 LEVEL: 0 ITERATION: 99 METRIC: -0.5093668808403666 LEVEL: 0 ITERATION: 100 METRIC: -0.509383252021113 LEVEL: 0 ITERATION: 101 METRIC: -0.5093982431109587 LEVEL: 0 ITERATION: 102 METRIC: -0.5094087199070355 LEVEL: 0 ITERATION: 103 METRIC: -0.5094233868828695 LEVEL: 0 ITERATION: 104 METRIC: -0.5094362632815291 LEVEL: 0 ITERATION: 105 METRIC: -0.5094517777421605 LEVEL: 0 ITERATION: 106 METRIC: -0.5094609186328296 LEVEL: 0 ITERATION: 107 METRIC: -0.5094730087572727 LEVEL: 0 ITERATION: 108 METRIC: -0.5094845376039567 LEVEL: 0 ITERATION: 109 METRIC: -0.5094973190389112 LEVEL: 0 ITERATION: 110 METRIC: -0.5095113266061497 LEVEL: 0 ITERATION: 111 METRIC: -0.509524012397447 LEVEL: 0 ITERATION: 112 METRIC: -0.5095386844496351 LEVEL: 0 ITERATION: 113 METRIC: -0.5095482922903689 LEVEL: 0 ITERATION: 114 METRIC: -0.5095602501030922 LEVEL: 0 ITERATION: 115 METRIC: -0.5095674961472751 LEVEL: 0 ITERATION: 116 METRIC: -0.5095784583376415 LEVEL: 0 ITERATION: 117 METRIC: -0.5095833147552528 LEVEL: 0 ITERATION: 118 METRIC: -0.5095925119564692 LEVEL: 0 ITERATION: 119 METRIC: -0.5095976952619861 LEVEL: 0 ITERATION: 120 METRIC: -0.509606159238133 LEVEL: 0 ITERATION: 121 METRIC: -0.5096178264521586 LEVEL: 0 ITERATION: 122 METRIC: -0.5096245047607819 LEVEL: 0 ITERATION: 123 METRIC: -0.5096347203004734 LEVEL: 0 ITERATION: 124 METRIC: -0.5096431954787751 LEVEL: 0 ITERATION: 125 METRIC: -0.5096479335080613 LEVEL: 0 ITERATION: 126 METRIC: -0.5096586029186595 LEVEL: 0 ITERATION: 127 METRIC: -0.5096636256486163 LEVEL: 0 ITERATION: 128 METRIC: -0.5096711551089081 LEVEL: 0 ITERATION: 129 METRIC: -0.5096715998881329 LEVEL: 0 ITERATION: 130 METRIC: -0.5096807693991748 LEVEL: 0 ITERATION: 131 METRIC: -0.509680323123783 LEVEL: 0 ITERATION: 132 METRIC: -0.5096939514539075 LEVEL: 0 ITERATION: 133 METRIC: -0.5096832004048439 LEVEL: 0 ITERATION: 134 METRIC: -0.5096930428913706 LEVEL: 0 ITERATION: 135 METRIC: -0.5096963343795077 LEVEL: 0 ITERATION: 136 METRIC: -0.5096985480554834 LEVEL: 0 ITERATION: 137 METRIC: -0.5097053285208764 LEVEL: 0 ITERATION: 138 METRIC: -0.5097062599052983 LEVEL: 0 ITERATION: 139 METRIC: -0.5097065341544831 LEVEL: 0 ITERATION: 140 METRIC: -0.5097091262876804 LEVEL: 0 ITERATION: 141 METRIC: -0.5097118115581087 LEVEL: 0 ITERATION: 142 METRIC: -0.5097138723508882 LEVEL: 0 ITERATION: 143 METRIC: -0.5097140698968139 LEVEL: 0 ITERATION: 144 METRIC: -0.5097168842510917 LEVEL: 0 ITERATION: 145 METRIC: -0.5097187543022861 LEVEL: 0 ITERATION: 146 METRIC: -0.5097178910183721 LEVEL: 0 ITERATION: 147 METRIC: -0.509717787232802 LEVEL: 0 ITERATION: 148 METRIC: -0.509721114706433 LEVEL: 0 ITERATION: 149 METRIC: -0.5097219704117902 LEVEL: 0 ITERATION: 150 METRIC: -0.5097228688115688 LEVEL: 0 ITERATION: 151 METRIC: -0.5097229399058182 LEVEL: 0 ITERATION: 152 METRIC: -0.5097223881093069 LEVEL: 0 ITERATION: 153 METRIC: -0.5097235058119421 LEVEL: 0 ITERATION: 154 METRIC: -0.5097244887067044 LEVEL: 0 ITERATION: 155 METRIC: -0.5097259018461415 LEVEL: 0 ITERATION: 156 METRIC: -0.5097241831451017 LEVEL: 0 ITERATION: 157 METRIC: -0.5097223707268944 LEVEL: 0 ITERATION: 158 METRIC: -0.5097222373359249 LEVEL: 0 ITERATION: 159 METRIC: -0.5097220499745916 LEVEL: 0 ITERATION: 160 METRIC: -0.509722159214666 LEVEL: 0 ITERATION: 161 METRIC: -0.5097215664996297 LEVEL: 0 ITERATION: 162 METRIC: -0.5097227202824104 LEVEL: 0 ITERATION: 163 METRIC: -0.5097227667023613 LEVEL: 0 ITERATION: 164 METRIC: -0.5097228943788019 LEVEL: 0 ITERATION: 165 METRIC: -0.5097234897438061 LEVEL: 0 ITERATION: 166 METRIC: -0.5097231687651836 LEVEL: 0 ITERATION: 167 METRIC: -0.5097243369825106 LEVEL: 0 ITERATION: 168 METRIC: -0.5097235911748609 LEVEL: 0 ITERATION: 169 METRIC: -0.5097204809735754 LEVEL: 0 ITERATION: 170 METRIC: -0.509722428591006 LEVEL: 0 ITERATION: 171 METRIC: -0.5097209717667393 LEVEL: 0 ITERATION: 172 METRIC: -0.5097226721919533 LEVEL: 0 ITERATION: 173 METRIC: -0.5097165765864129 LEVEL: 0 ITERATION: 174 METRIC: -0.509721221019394 LEVEL: 0 ITERATION: 175 METRIC: -0.5097211618658549 LEVEL: 0 ITERATION: 176 METRIC: -0.5097204024095183 LEVEL: 0 ITERATION: 177 METRIC: -0.5097211315653398 LEVEL: 0 ITERATION: 178 METRIC: -0.5097208009936721 LEVEL: 0 ITERATION: 179 METRIC: -0.5097187495403702 LEVEL: 0 ITERATION: 180 METRIC: -0.5097178978806756 LEVEL: 0 ITERATION: 181 METRIC: -0.5097182427254202 LEVEL: 0 ITERATION: 182 METRIC: -0.5097184599600053 LEVEL: 0 ITERATION: 183 METRIC: -0.5097191467443036 LEVEL: 0 ITERATION: 184 METRIC: -0.5097183101747074 LEVEL: 0 ITERATION: 185 METRIC: -0.5097175381184262 LEVEL: 0 ITERATION: 186 METRIC: -0.5097182576723988 LEVEL: 0 ITERATION: 187 METRIC: -0.5097178421670088 LEVEL: 0 ITERATION: 188 METRIC: -0.5097172999477873 LEVEL: 0 ITERATION: 189 METRIC: -0.5097183468666006 LEVEL: 0 ITERATION: 190 METRIC: -0.50971923724084 LEVEL: 0 ITERATION: 191 METRIC: -0.5097178872277622 LEVEL: 0 ITERATION: 192 METRIC: -0.5097164868539895 LEVEL: 0 ITERATION: 193 METRIC: -0.5097159221830665 LEVEL: 0 ITERATION: 194 METRIC: -0.5097154886871554 LEVEL: 0 ITERATION: 195 METRIC: -0.509713193716958 LEVEL: 0 ITERATION: 196 METRIC: -0.5097122259648657 LEVEL: 0 ITERATION: 197 METRIC: -0.5097114356051958 LEVEL: 0 ITERATION: 198 METRIC: -0.509713728837721 LEVEL: 0 ITERATION: 199 METRIC: -0.5097123956242398 LEVEL: 0 ITERATION: 200 METRIC: -0.5097128240723408 LEVEL: 0 ITERATION: 201 METRIC: -0.5097108193696624 LEVEL: 0 ITERATION: 202 METRIC: -0.509710368410682 LEVEL: 0 ITERATION: 203 METRIC: -0.5097091131155792 LEVEL: 0 ITERATION: 204 METRIC: -0.5097076479044138 LEVEL: 0 ITERATION: 205 METRIC: -0.5097093343482894 LEVEL: 0 ITERATION: 206 METRIC: -0.5097087002993401 LEVEL: 0 ITERATION: 207 METRIC: -0.5097083009614303 LEVEL: 0 ITERATION: 208 METRIC: -0.5097071918920788 LEVEL: 0 ITERATION: 209 METRIC: -0.5097067854194948 LEVEL: 0 ITERATION: 210 METRIC: -0.5097057318179653 LEVEL: 0 ITERATION: 211 METRIC: -0.5097081717519585 LEVEL: 0 ITERATION: 212 METRIC: -0.509707855748597 LEVEL: 0 ITERATION: 213 METRIC: -0.5097067545324598 LEVEL: 0 ITERATION: 214 METRIC: -0.5097062380112495 LEVEL: 0 ITERATION: 215 METRIC: -0.5097056528931018 LEVEL: 0 ITERATION: 216 METRIC: -0.5097065519091667 LEVEL: 0 ITERATION: 217 METRIC: -0.5097060156286227 LEVEL: 0 ITERATION: 218 METRIC: -0.5097055661613286 LEVEL: 0 ITERATION: 219 METRIC: -0.5097051993482833 Registration succeeded Block index: (0, 0, 0) Slices: (slice(0, 192, None), slice(0, 192, None), slice(0, 192, None)) Block index: (0, 0, 4) Slices: (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None)) Block index: (0, 0, 1) Slices: (slice(0, 192, None), slice(0, 192, None), slice(64, 320, None)) Block index: (0, 0, 5) Slices: Block index: (slice(0, 192, None), slice(0, 192, None), slice(576, 832, None)) (0, 0, 3) Slices: (slice(0, 192, None), slice(0, 192, None), slice(320, 576, None)) Block index: Block index: (0, 0, 2) Slices: (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None)) (0, 0, 6)Block index: (0, 0, 7) Slices: (slice(0, 192, None), slice(0, 192, None), slice(832, 913, None)) Slices: (slice(0, 192, None), slice(0, 192, None), slice(704, 913, None)) Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} ```

stderr

``` 2023-07-25 11:20:28,910 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:28,912 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:29,854 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 2.69 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:29,855 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.69 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:32,700 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:33033 (pid=130745) exceeded 95% memory budget. Restarting... 2023-07-25 11:20:32,827 - distributed.nanny - WARNING - Restarting worker 2023-07-25 11:20:33,250 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:32769 (pid=130743) exceeded 95% memory budget. Restarting... 2023-07-25 11:20:33,369 - distributed.nanny - WARNING - Restarting worker Block index: (0, 1, 0) Slices: (slice(0, 192, None), slice(64, 320, None), slice(0, 192, None)) Block index: (0, 0, 7) Slices: (slice(0, 192, None), slice(0, 192, None), slice(832, 913, None)) Block index: (0, 0, 4)Block index: (0, 0, 2) Slices: (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None)) Slices: (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None)) Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} 2023-07-25 11:20:47,535 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:47,536 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:20:49,502 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:37465 (pid=130985) exceeded 95% memory budget. Restarting... 2023-07-25 11:20:49,577 - distributed.nanny - WARNING - Restarting worker Block index: (0, 0, 4) Slices: (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None)) Block index: (0, 0, 2) Slices: (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None)) Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} 2023-07-25 11:21:03,030 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:21:03,031 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:21:05,000 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:44825 (pid=131141) exceeded 95% memory budget. Restarting... 2023-07-25 11:21:05,083 - distributed.nanny - WARNING - Restarting worker Block index: (0, 0, 4) Slices: (slice(0, 192, None), slice(0, 192, None), slice(448, 704, None)) Block index: (0, 0, 2) Slices: (slice(0, 192, None), slice(0, 192, None), slice(192, 448, None)) Run ransac {'blob_sizes': [6, 20]} Run ransac {'blob_sizes': [6, 20]} Fix spots: 396 Moving spots: 176 Fewer than 50 spots found in fixed image, returning default 46 Run deform {'smooth_sigmas': (0.25,), 'control_point_spacing': 50.0, 'control_point_levels': (1,), 'optimizer_args': {'learningRate': 0.25, 'minStep': 0.0, 'numberOfIterations': 25}} 2023-07-25 11:21:22,724 - distributed.worker.memory - WARNING - Worker is at 90% memory usage. Pausing worker. Process memory: 2.68 GiB -- Worker memory limit: 2.97 GiB 2023-07-25 11:21:22,727 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 2.68 GiB -- Worker memory limit: 2.97 GiB LEVEL: 0 ITERATION: 0 METRIC: -0.3962209122288713 2023-07-25 11:21:29,200 - distributed.nanny.memory - WARNING - Worker tcp://172.17.145.68:41689 (pid=131245) exceeded 95% memory budget. Restarting... 2023-07-25 11:21:29,470 - distributed.nanny - WARNING - Restarting worker 2023-07-25 11:21:32,792 - distributed.nanny - WARNING - Worker process still alive after 3.1999981689453127 seconds, killing 2023-07-25 11:21:32,793 - distributed.nanny - WARNING - Worker process still alive after 3.199998474121094 seconds, killing ```

error message

``` --------------------------------------------------------------------------- KilledWorker Traceback (most recent call last) Cell In[3], line 10 4 cluster_kwargs = { 5 'n_workers':4, 6 'memory_limit':0.2, 7 } 9 # run the pipeline ---> 10 affine, deform, aligned = easifish_registration_pipeline( 11 fix_lowres, fix_highres, mov_lowres, mov_highres, 12 fix_lowres_spacing, fix_highres_spacing, 13 mov_lowres_spacing, mov_highres_spacing, 14 blocksize=[128,]*3, 15 write_directory='[./](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/)', 16 cluster_kwargs=cluster_kwargs 17 ) 19 # the affine and deform are already saved to disk, but we also want to view the aligned 20 # result to make sure it worked. 21 # reformat the aligned data to open in fiji (or similar) - again this works for tutorial data 22 # but you would do this differently for actually larger-than-memory data 23 tifffile.imsave('[./aligned.tiff](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/aligned.tiff)', aligned[...]) File [~/Documents/Arbeit/2023_HiWi_QBiC/bigstream/bigstream/application_pipelines.py:232](https://file+.vscode-resource.vscode-cdn.net/home/casimir/Documents/Uni/Bioinformatik_M.sc/4.Semester_SS23/HiWi/tmp/~/Documents/Arbeit/2023_HiWi_QBiC/bigstream/bigstream/application_pipelines.py:232), in easifish_registration_pipeline(fix_lowres, fix_highres, mov_lowres, mov_highres, fix_lowres_spacing, fix_highres_spacing, mov_lowres_spacing, mov_highres_spacing, blocksize, write_directory, global_ransac_kwargs, global_affine_kwargs, local_ransac_kwargs, local_deform_kwargs, cluster_kwargs, cluster) 229 if cluster is None: 230 #with cluster_constructor(**cluster_kwargs) as cluster: #NOTE: removed **{**c, **cluster_kwargs} ... -> 2231 raise exception.with_traceback(traceback) 2232 raise exc 2233 if errors == "skip": KilledWorker: Attempted to run task align_single_block-c433c39107c7a38ec826bbb66882de03 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://172.17.145.68:35047. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html. ```

any idea in what direction I should proceed?

GFleishman commented 11 months ago

Yes you are correct, the nanny and the dask scheduler together are causing this job to fail.

First I notice that your machine has only 16GB of RAM. This should actually be fine for the test datasets once we get configuration right. But that's honestly quite small for processing any real data. We'll need to really constrain the parallelism to get the work to fit into this machine, and once you scale to real big datasets you'll find they take a very long time. More RAM would really help you parallelize jobs better; so access to a bigger workstation or a cluster might be important depending on what kind of data you are eventually intending this for.

So each worker is getting 0.2 * 16GB ~= 3GB of RAM. Note dask reports everything in Gibibytes not Gigabytes so their numbers are a little different. The dask scheduler is also trying to submit more than one block to a worker at a time. It thinks that a single worker can solve multiple of the block alignments in parallel. Because of the small amount of RAM and the multiple tasks, the workers are exceeding their memory limits. The nanny notices this and pauses, then shuts down the workers. Once this happens three times the dask scheduler decides something is wrong and shuts the whole cluster down.

So we need to make sure that each worker only tries to process one job at a time. My first suggestion is to add 'threads_per_worker':1 to your cluster_kwargs. Theoretically this should allow dask to only submit a single thread to each worker, and with only one thread it can only execute one task at a time. The tasks themselves (the alignment algorithms that will run) are multithreaded, so they will use all the resources the worker can provide - so you're not losing any parallelism here, you're just preventing dask from overloading the workers.

I have noticed in the past however when doing similar things but in a different computing environment, that telling dask to submit only one thread per worker does not prevent it from trying to execute multiple tasks on that worker. So if trying the first suggestion above does not work, then try this second suggestion. Add this to cluster_kwargs:

'threads_per_worker':1,
'config':{
        'distributed.worker.memory.target':0.9,
        'distributed.worker.memory.spill':0.9,
        'distributed.worker.memory.pause':0.9,
        'distributed.scheduler.worker-saturation':0.5,
}

Finally if this does not work I have one last suggestion, which will also require adding something to the source code. Add this to cluster_kwargs:

'threads_per_worker':1,
'config':{
        'distributed.worker.memory.target':0.9,
        'distributed.worker.memory.spill':0.9,
        'distributed.worker.memory.pause':0.9,
        'distributed.scheduler.worker-saturation':0.5,
        'distributed.worker.resources.concurrency':1,
}

Now to edit some source code. In the bigstream repository find the file: bigstream/piecewise_align.py and look at line 358. You should see:

futures = cluster.client.map(
    align_single_block, indices,
    static_transform_list=static_transform_list,
)

Change this to the following:

futures = cluster.client.map(
    align_single_block, indices,
    static_transform_list=static_transform_list,
    resources={'concurrency':1},
)

Save that change. If you installed bigstream before using pip install -e ./ then this change should be available already. Be sure the restart the kernel in your Jupyter notebook after making this source code edit for it to take effect.

I haven't tried these tutorial datasets on a small machine like you are trying, so I hope you'll forgive the extra work required to get them working in that context. It's a good learning experience for me and will help the package be more accommodating for future users as well. Let me know how these changes affect things. Ideally the first suggestion will just fix it, but if not try the others and we'll figure out a way to make it work.

GFleishman commented 11 months ago

@HomoPolyethylen Just checking if you've had a chance to try out the suggestions above.

GFleishman commented 6 months ago

Since all of the solutions to "small machine memory issues" are presented in this thread I'm going to close for now - but I'm willing to reopen if OP needs more help in the future.

JaneliaSciComp / bigstream

easifish tutorial: issue with unknown kwargs #18