SpiNNakerManchester / PACMAN

Partition and Configuration Manager for SpiNNaker
Apache License 2.0
9 stars 7 forks source link

Partitioner taking a long time. #193

Closed Christian-B closed 5 years ago

Christian-B commented 5 years ago

May be related to: https://github.com/SpiNNakerManchester/PACMAN/issues/190

Using a very big 40,001 core but simple network: Time 0:16:01.598417 taken by PartitionAndPlacePartitioner

Script:

import spynnaker8 as sim from spynnaker8.utilities import neo_convertor

n_neurons = 10000 n_populations = 200 weights = 5 delays = 17.0 simtime = 50000

sim.setup(timestep=1.0, min_delay=1.0, max_delay=144.0) sim.set_number_of_neurons_per_core(sim.IF_curr_exp, 100)

spikeArray = {'spike_times': [[0]]} stimulus = sim.Population(1, sim.SpikeSourceArray, spikeArray, label='stimulus')

chain_pops = [ sim.Population(n_neurons, sim.IF_currexp, {}, label='chain{}'.format(i)) for i in range(n_populations) ] for pop in chain_pops: pop.record("all")

connector = sim.OneToOneConnector() for i in range(n_populations): sim.Projection(chain_pops[i], chain_pops[(i + 1) % n_populations], connector, synapse_type=sim.StaticSynapse(weight=weights, delay=delays))

sim.Projection(stimulus, chain_pops[0], sim.AllToAllConnector(), synapse_type=sim.StaticSynapse(weight=5.0))

sim.run(simtime)

Christian-B commented 5 years ago

Note same script also dies later:

2018-12-19 13:02:15 ERROR: Error when calling pacman.operations.fixed_route_router.fixed_route_router.FixedRouteRouter.call with inputs dict_keys(['board_version', 'placements', 'destination_class', 'machine']) Traceback (most recent call last): File "/home/brenninc/spinnaker/IntroLab/synfire/synfire.py", line 39, in sim.run(simtime) File "/home/brenninc/spinnaker/sPyNNaker8/spynnaker8/init.py", line 618, in run return pynn["run"](simtime, callbacks=callbacks) File "/home/brenninc/3.5_pynn0.9/lib/python3.5/site-packages/pyNN/common/control.py", line 111, in run return run_until(simulator.state.t + simtime, callbacks) File "/home/brenninc/3.5_pynn0.9/lib/python3.5/site-packages/pyNN/common/control.py", line 93, in run_until simulator.state.run_until(time_point) File "/home/brenninc/spinnaker/sPyNNaker8/spynnaker8/spinnaker.py", line 123, in run_until self._run_wait(tstop - self.t) File "/home/brenninc/spinnaker/sPyNNaker8/spynnaker8/spinnaker.py", line 166, in _run_wait super(SpiNNaker, self).run(duration_ms) File "/home/brenninc/spinnaker/sPyNNaker/spynnaker/pyNN/abstract_spinnaker_common.py", line 317, in run super(AbstractSpiNNakerCommon, self).run(run_time) File "/home/brenninc/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 817, in run self._run(run_time) File "/home/brenninc/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 938, in _run self._do_mapping(run_time, n_machine_time_steps, total_run_time) File "/home/brenninc/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1716, in _do_mapping optional_algorithms) File "/home/brenninc/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1212, in _run_algorithms reraise(*exc_info) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/home/brenninc/spinnaker/SpiNNFrontEndCommon/spinn_front_end_common/interface/abstract_spinnaker_base.py", line 1197, in _run_algorithms executor.execute_mapping() File "/home/brenninc/spinnaker/PACMAN/pacman/executor/pacman_algorithm_executor.py", line 623, in execute_mapping self._execute_mapping() File "/home/brenninc/spinnaker/PACMAN/pacman/executor/pacman_algorithm_executor.py", line 639, in _execute_mapping results = algorithm.call(self._internal_type_mapping) File "/home/brenninc/spinnaker/PACMAN/pacman/executor/algorithm_classes/abstract_python_algorithm.py", line 45, in call results = self.call_python(method_inputs) File "/home/brenninc/spinnaker/PACMAN/pacman/executor/algorithm_classes/python_class_algorithm.py", line 56, in call_python return method(**inputs) File "/home/brenninc/spinnaker/PACMAN/pacman/operations/fixed_route_router/fixed_route_router.py", line 109, in call destination_class, machine, board_version) File "/home/brenninc/spinnaker/PACMAN/pacman/operations/fixed_route_router/fixed_route_router.py", line 202, in _do_dynamic_routing machine_graph=graph, use_progress_bar=False) File "/home/brenninc/spinnaker/PACMAN/pacman/operations/router_algorithms/basic_dijkstra_routing.py", line 98, in call__ nodes_info, tables) File "/home/brenninc/spinnaker/PACMAN/pacman/operations/router_algorithms/basic_dijkstra_routing.py", line 120, in _route tables[placement.x, placement.y].activated = True KeyError: (213, 97)

Christian-B commented 5 years ago

Note there appears to be an growing cost in the partitioner as the exact same script reduced by a factor of 10 n_populations = 20

takes: 2018-12-19 13:07:44 INFO: Time 0:00:24.960147 taken by PartitionAndPlacePartitioner

rowleya commented 5 years ago

Doing some tests - the partitioning of edges seems to be taking a while (population partitioning was reasonable)...

rowleya commented 5 years ago

It is worth noting that this example results in over 40,000 cores being used. Each application vertex has 10,000 atoms and gets split in to 100 neurons each, so there are 100 machine vertices for every application vertex. Additionally, each of these has a delay extension. During partitioning, we are not allowed to remove edges, so a single edge going from 10,000 atoms to 10,000 atoms becomes 100 edges from every pre-vertex to every post-vertex, so 10,000 edges per pair of populations.

I think that this is therefore not that surprising overall. This is one of the issues we have to deal with however; a very simple network description can explode into a very large SpiNNaker network with very little effort.

rowleya commented 5 years ago

Note also that your key error could be a sign of something that needs blacklisting in the machine. It is worth keeping an eye on which board you got when you got this error, and seeing if it reoccurs in future simulations at the same place.

Christian-B commented 5 years ago

Yes I full agree that this is a LARGE job with 40001 cores uses excluding special ones.

The interesting factor here is that an increase of times 10 cores results in an increase of times 40 in time.

rowleya commented 5 years ago

This is true, though I would need to think about it a bit more. It may be that the connectivity is more than 10 times increased with 10 times as many cores... It isn't clear that this is the case here though...

rowleya commented 5 years ago

In my own test, I found that with 200 populations, it took 2 minutes 52 seconds to partition, or 172 seconds. With 20 populations, it took 22 seconds. So that looks like a less than 10 times increase to me.

rowleya commented 5 years ago

Partitioning does result in 5.5GB of RAM being used by the python process once I run it however. I think your system has at least 16GB RAM so I would have expected it to cope with this without caching, but it could be some sort of system thrashing that causes your run to take 16 minutes.

Christian-B commented 5 years ago

Rowley does not consider this an issue so closing.