camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Null pointers in Whare-Map and Octopus #35

Closed ghost closed 8 years ago

ghost commented 8 years ago

I am getting null pointer assertion failures when trying to use two of the cost models: whare-map and octopus. Error is like this:

$ build/engine/coordinator --task_lib_dir=$(pwd)/build/engine/ --listen_uri=tcp:10.0.1.101:55556 --scheduler flow --flow_scheduling_cost_model 6
rm: cannot remove ‘/tmp/firmament-debug/*’: No such file or directory
F1129 15:38:31.345584  1278 octopus_cost_model.cc:191] Check failed: 'rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fe3a3c9fdaa  (unknown)
    @     0x7fe3a3c9fce4  (unknown)
    @     0x7fe3a3c9f6e6  (unknown)
    @     0x7fe3a3ca2687  (unknown)
    @           0x613a9f  firmament::OctopusCostModel::GatherStats()
    @           0x643bd3  firmament::FlowGraphManager::ComputeTopologyStatistics()
    @           0x611a19  firmament::scheduler::FlowScheduler::UpdateCostModelResourceStats()
    @           0x612769  firmament::scheduler::FlowScheduler::RegisterResource()
    @           0x590f71  firmament::Coordinator::AddResource()
    @           0x6a4b05  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5909d1  firmament::Coordinator::DetectLocalResources()
    @           0x5912d2  firmament::Coordinator::Run()
    @           0x551b95  main
    @     0x7fe3a0b15ec5  (unknown)
    @           0x55179d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

and

$ build/engine/coordinator --task_lib_dir=$(pwd)/build/engine/ --listen_uri=tcp:10.0.1.101:55556 --scheduler flow --flow_scheduling_cost_model 4
rm: cannot remove ‘/tmp/firmament-debug/*’: No such file or directory
F1129 15:40:05.337487  1299 wharemap_cost_model.cc:760] Check failed: 'rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fd8f6eeadaa  (unknown)
    @     0x7fd8f6eeace4  (unknown)
    @     0x7fd8f6eea6e6  (unknown)
    @     0x7fd8f6eed687  (unknown)
    @           0x62ada9  firmament::WhareMapCostModel::GatherStats()
    @           0x643bd3  firmament::FlowGraphManager::ComputeTopologyStatistics()
    @           0x611a19  firmament::scheduler::FlowScheduler::UpdateCostModelResourceStats()
    @           0x612769  firmament::scheduler::FlowScheduler::RegisterResource()
    @           0x590f71  firmament::Coordinator::AddResource()
    @           0x6a4b05  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5909d1  firmament::Coordinator::DetectLocalResources()
    @           0x5912d2  firmament::Coordinator::Run()
    @           0x551b95  main
    @     0x7fd8f3d60ec5  (unknown)
    @           0x55179d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

The other cost models (that are marked as complete in README.md) work fine.

ghost commented 8 years ago

Here is more verbose output of the error.

I added instumentation to FindPtrOrNull() and the problem is that the key accumulator->resource_id_ is not present in the map (not that the key has a null value). But I don't know enough yet about how to the map gets populated to fix this myself.

 (master *%=) mike@docker1:~/firmament.io/firmament$ build/engine/coordinator --task_lib_dir=$(pwd)/build/engine/ --listen_uri=tcp:10.0.1.101:55556 --scheduler flow --flow_scheduling_cost_model 6 -v 100 -stderrthreshold -1 -logbuflevel -1
I1129 15:44:08.355933  1719 coordinator_main.cc:28] Firmament coordinator starting (Platform: 1) ...
I1129 15:44:08.356878  1719 utils.cc:96] Seeding resource ID RNG with 4438399909416742461 from seed docker1/tcp:10.0.1.101:55556
I1129 15:44:08.357102  1719 node.cc:47] Node's adapter is at 0x61400000f840
I1129 15:44:08.357251  1719 signal_handler.cc:12] Signal handler set up, ready to add signals.
I1129 15:44:08.359493  1719 topology_manager.cc:20] Topology manager initialized.
I1129 15:44:08.359665  1719 topology_manager.cc:72] Analyzing machine topology...
I1129 15:44:08.371872  1719 simple_object_store.cc:23] Constructing simple object store
I1129 15:44:08.373126  1719 coordinator.cc:89] Using Quincy-style min cost flow-based scheduler.
I1129 15:44:08.373502  1719 event_driven_scheduler.cc:57] EventDrivenScheduler initiated.
I1129 15:44:08.373844  1719 flow_scheduler.cc:63] Set cost model to use in flow graph to "6"
I1129 15:44:08.374166  1719 octopus_cost_model.cc:24] Cluster aggregator EC is 401933894075308583
I1129 15:44:08.374503  1719 flow_scheduler.cc:97] Using the octopus cost model
I1129 15:44:08.375030  1719 pb_utils.cc:87] BFSTraversal of resource topology, reached c2a50435-1b12-4ed0-946b-e8f907901452, invoking callback [1]
I1129 15:44:08.375404  1719 flow_graph_manager.cc:1146] Considering resource c2a50435-1b12-4ed0-946b-e8f907901452, which is 0
I1129 15:44:08.375720  1719 flow_graph_manager.cc:1185] Adding new resource c2a50435-1b12-4ed0-946b-e8f907901452 to flow graph.
I1129 15:44:08.376168  1719 flow_graph_manager.cc:385] Adding node for resource c2a50435-1b12-4ed0-946b-e8f907901452 (Coordinator on docker1), type RESOURCE_COORDINATOR
I1129 15:44:08.377370  1719 flow_graph_manager.cc:1200] Updated resource topology in flow scheduler.
I1129 15:44:08.377668  1719 flow_scheduler.cc:449] Updating resource statistics in flow graph
rm: cannot remove ‘/tmp/firmament-debug/*’: No such file or directory
I1129 15:44:08.383509  1719 coordinator.cc:101] Coordinator starting on host tcp:10.0.1.101:55556, platform 1, uuid c2a50435-1b12-4ed0-946b-e8f907901452
I1129 15:44:08.384131  1719 coordinator.cc:210] Detecting resource topology:
I1129 15:44:08.384269  1719 topology_manager.cc:185] *** LEVEL: 0
I1129 15:44:08.384438  1719 topology_manager.cc:190] Index: 0: Machine#0(2002MB)
I1129 15:44:08.384541  1719 topology_manager.cc:185] *** LEVEL: 1
I1129 15:44:08.384666  1719 topology_manager.cc:190] Index: 0: Socket#0
I1129 15:44:08.384899  1719 topology_manager.cc:185] *** LEVEL: 2
I1129 15:44:08.385016  1719 topology_manager.cc:190] Index: 0: L2d(6144KB)
I1129 15:44:08.385092  1719 topology_manager.cc:185] *** LEVEL: 3
I1129 15:44:08.385180  1719 topology_manager.cc:190] Index: 0: L1d(32KB)
I1129 15:44:08.385242  1719 topology_manager.cc:190] Index: 1: L1d(32KB)
I1129 15:44:08.385329  1719 topology_manager.cc:185] *** LEVEL: 4
I1129 15:44:08.385390  1719 topology_manager.cc:190] Index: 0: Core#0
I1129 15:44:08.385476  1719 topology_manager.cc:190] Index: 1: Core#1
I1129 15:44:08.385535  1719 topology_manager.cc:185] *** LEVEL: 5
I1129 15:44:08.385627  1719 topology_manager.cc:190] Index: 0: PU#0
I1129 15:44:08.385689  1719 topology_manager.cc:190] Index: 1: PU#1
I1129 15:44:08.385800  1719 coordinator.cc:152] Found 2 local PUs.
I1129 15:44:08.385895  1719 coordinator.cc:153] Resource URI is tcp:10.0.1.101:55556
I1129 15:44:08.386685  1719 topology_manager.cc:30] resource_desc {
  uuid: "a9605cf6-f45c-4b50-95d0-318637033ffc"
  friendly_name: "Machine #0(2002MB)"
  type: RESOURCE_MACHINE
}
children {
  resource_desc {
    uuid: "181d78d7-0c42-4bcb-8c2b-36698d856c35"
    friendly_name: "Socket #0"
    type: RESOURCE_SOCKET
  }
  children {
    resource_desc {
      uuid: "47a7027e-ae17-4524-a920-947046e3389b"
      friendly_name: "L2d(6144KB)"
      type: RESOURCE_CACHE
    }
    children {
      resource_desc {
        uuid: "0dc390a0-d269-4350-acd2-bda6ed380193"
        friendly_name: "L1d(32KB)"
        type: RESOURCE_CACHE
      }
      children {
        resource_desc {
          uuid: "14671e39-238a-4e68-98c0-4c60184be6e0"
          friendly_name: "Core #0"
          type: RESOURCE_CORE
        }
        children {
          resource_desc {
            uuid: "142edfe7-ff48-4a4d-9e78-2980130f56c0"
            friendly_name: "PU #0"
            type: RESOURCE_PU
          }
          parent_id: "14671e39-238a-4e68-98c0-4c60184be6e0"
        }
        parent_id: "0dc390a0-d269-4350-acd2-bda6ed380193"
      }
      parent_id: "47a7027e-ae17-4524-a920-947046e3389b"
    }
    children {
      resource_desc {
        uuid: "47d9fbee-a671-4d2d-8ea5-b5e452972773"
        friendly_name: "L1d(32KB)"
        type: RESOURCE_CACHE
      }
      children {
        resource_desc {
          uuid: "0369cffa-83ad-4438-802e-b39e704521ff"
          friendly_name: "Core #1"
          type: RESOURCE_CORE
        }
        children {
          resource_desc {
            uuid: "f2746005-abee-4363-958a-d7ed49cb836a"
            friendly_name: "PU #1"
            type: RESOURCE_PU
          }
          parent_id: "0369cffa-83ad-4438-802e-b39e704521ff"
        }
        parent_id: "47d9fbee-a671-4d2d-8ea5-b5e452972773"
      }
      parent_id: "47a7027e-ae17-4524-a920-947046e3389b"
    }
    parent_id: "181d78d7-0c42-4bcb-8c2b-36698d856c35"
  }
  parent_id: "a9605cf6-f45c-4b50-95d0-318637033ffc"
}
I1129 15:44:08.389194  1719 pb_utils.cc:87] BFSTraversal of resource topology, reached c2a50435-1b12-4ed0-946b-e8f907901452, invoking callback [1]
I1129 15:44:08.389389  1719 coordinator.cc:176] Adding resource c2a50435-1b12-4ed0-946b-e8f907901452 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.389544  1719 coordinator.cc:176] Adding resource a9605cf6-f45c-4b50-95d0-318637033ffc to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390146  1719 coordinator.cc:176] Adding resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390228  1719 coordinator.cc:176] Adding resource 47a7027e-ae17-4524-a920-947046e3389b to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390328  1719 coordinator.cc:176] Adding resource 0dc390a0-d269-4350-acd2-bda6ed380193 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390409  1719 coordinator.cc:176] Adding resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390498  1719 coordinator.cc:176] Adding resource 14671e39-238a-4e68-98c0-4c60184be6e0 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390568  1719 coordinator.cc:176] Adding resource 0369cffa-83ad-4438-802e-b39e704521ff to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390658  1719 coordinator.cc:176] Adding resource 142edfe7-ff48-4a4d-9e78-2980130f56c0 to resource map; endpoint URI is tcp:10.0.1.101:55556
I1129 15:44:08.390732  1719 pb_utils.cc:87] BFSTraversal of resource topology, reached c2a50435-1b12-4ed0-946b-e8f907901452, invoking callback [1]
I1129 15:44:08.390825  1719 flow_graph_manager.cc:1146] Considering resource c2a50435-1b12-4ed0-946b-e8f907901452, which is 2
I1129 15:44:08.390902  1719 pb_utils.cc:87] BFSTraversal of resource topology, reached a9605cf6-f45c-4b50-95d0-318637033ffc, invoking callback [1]
I1129 15:44:08.390992  1719 flow_graph_manager.cc:385] Adding node for resource a9605cf6-f45c-4b50-95d0-318637033ffc (Machine #0(2002MB)), type RESOURCE_MACHINE
I1129 15:44:08.391134  1719 flow_graph_manager.cc:951] Resource c2a50435-1b12-4ed0-946b-e8f907901452 is represented by node 2
I1129 15:44:08.391297  1719 flow_graph_manager.cc:423] Adding missing arc from parent c2a50435-1b12-4ed0-946b-e8f907901452(2) to a9605cf6-f45c-4b50-95d0-318637033ffc(3).
I1129 15:44:08.391397  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.391504  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.391644  1719 flow_graph_manager.cc:951] Resource c2a50435-1b12-4ed0-946b-e8f907901452 is represented by node 2
I1129 15:44:08.391757  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.392446  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.392613  1719 flow_graph_manager.cc:385] Adding node for resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 (Socket #0), type RESOURCE_SOCKET
I1129 15:44:08.392709  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.392804  1719 flow_graph_manager.cc:423] Adding missing arc from parent a9605cf6-f45c-4b50-95d0-318637033ffc(3) to 181d78d7-0c42-4bcb-8c2b-36698d856c35(4).
I1129 15:44:08.392875  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.392966  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.393090  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.393195  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.393280  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.393391  1719 flow_graph_manager.cc:385] Adding node for resource 47a7027e-ae17-4524-a920-947046e3389b (L2d(6144KB)), type RESOURCE_CACHE
I1129 15:44:08.393488  1719 flow_graph_manager.cc:951] Resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 is represented by node 4
I1129 15:44:08.393591  1719 flow_graph_manager.cc:423] Adding missing arc from parent 181d78d7-0c42-4bcb-8c2b-36698d856c35(4) to 47a7027e-ae17-4524-a920-947046e3389b(5).
I1129 15:44:08.393661  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.393749  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.393841  1719 flow_graph_manager.cc:951] Resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 is represented by node 4
I1129 15:44:08.393934  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394023  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394155  1719 flow_graph_manager.cc:385] Adding node for resource 0dc390a0-d269-4350-acd2-bda6ed380193 (L1d(32KB)), type RESOURCE_CACHE
I1129 15:44:08.394233  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.394330  1719 flow_graph_manager.cc:423] Adding missing arc from parent 47a7027e-ae17-4524-a920-947046e3389b(5) to 0dc390a0-d269-4350-acd2-bda6ed380193(6).
I1129 15:44:08.394399  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394487  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394573  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.394672  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394760  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.394867  1719 flow_graph_manager.cc:385] Adding node for resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 (L1d(32KB)), type RESOURCE_CACHE
I1129 15:44:08.394945  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.395050  1719 flow_graph_manager.cc:423] Adding missing arc from parent 47a7027e-ae17-4524-a920-947046e3389b(5) to 47d9fbee-a671-4d2d-8ea5-b5e452972773(7).
I1129 15:44:08.395126  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.396215  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.396350  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.396466  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.396569  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.396700  1719 flow_graph_manager.cc:385] Adding node for resource 14671e39-238a-4e68-98c0-4c60184be6e0 (Core #0), type RESOURCE_CORE
I1129 15:44:08.396786  1719 flow_graph_manager.cc:951] Resource 0dc390a0-d269-4350-acd2-bda6ed380193 is represented by node 6
I1129 15:44:08.396908  1719 flow_graph_manager.cc:423] Adding missing arc from parent 0dc390a0-d269-4350-acd2-bda6ed380193(6) to 14671e39-238a-4e68-98c0-4c60184be6e0(8).
I1129 15:44:08.397078  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.397824  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.397922  1719 flow_graph_manager.cc:951] Resource 0dc390a0-d269-4350-acd2-bda6ed380193 is represented by node 6
I1129 15:44:08.398015  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398098  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398206  1719 flow_graph_manager.cc:385] Adding node for resource 0369cffa-83ad-4438-802e-b39e704521ff (Core #1), type RESOURCE_CORE
I1129 15:44:08.398283  1719 flow_graph_manager.cc:951] Resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 is represented by node 7
I1129 15:44:08.398389  1719 flow_graph_manager.cc:423] Adding missing arc from parent 47d9fbee-a671-4d2d-8ea5-b5e452972773(7) to 0369cffa-83ad-4438-802e-b39e704521ff(9).
I1129 15:44:08.398458  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398548  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398641  1719 flow_graph_manager.cc:951] Resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 is represented by node 7
I1129 15:44:08.398735  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398819  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.398936  1719 flow_graph_manager.cc:385] Adding node for resource 142edfe7-ff48-4a4d-9e78-2980130f56c0 (PU #0), type RESOURCE_PU
I1129 15:44:08.399025  1719 flow_graph_manager.cc:951] Resource 14671e39-238a-4e68-98c0-4c60184be6e0 is represented by node 8
I1129 15:44:08.399132  1719 flow_graph_manager.cc:423] Adding missing arc from parent 14671e39-238a-4e68-98c0-4c60184be6e0(8) to 142edfe7-ff48-4a4d-9e78-2980130f56c0(10).
I1129 15:44:08.399207  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.399302  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.399449  1719 flow_graph_manager.cc:724] Considering node 142edfe7-ff48-4a4d-9e78-2980130f56c0, which has parent 14671e39-238a-4e68-98c0-4c60184be6e0
I1129 15:44:08.399590  1719 flow_graph_manager.cc:727] Adding arc from leaf resource 142edfe7-ff48-4a4d-9e78-2980130f56c0 to sink node.
I1129 15:44:08.399716  1719 flow_graph_manager.cc:951] Resource 142edfe7-ff48-4a4d-9e78-2980130f56c0 is represented by node 10
I1129 15:44:08.399806  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.399874  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.400159  1719 flow_graph_manager.cc:951] Resource 14671e39-238a-4e68-98c0-4c60184be6e0 is represented by node 8
I1129 15:44:08.400236  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.400346  1719 flow_graph_manager.cc:757] Adding capacity on edge from 14671e39-238a-4e68-98c0-4c60184be6e0 (8) to 10 (0 -> 1)
I1129 15:44:08.400441  1719 flow_graph_manager.cc:951] Resource 0dc390a0-d269-4350-acd2-bda6ed380193 is represented by node 6
I1129 15:44:08.400532  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.400596  1719 flow_graph_manager.cc:757] Adding capacity on edge from 0dc390a0-d269-4350-acd2-bda6ed380193 (6) to 8 (0 -> 1)
I1129 15:44:08.400710  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.400779  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.400867  1719 flow_graph_manager.cc:757] Adding capacity on edge from 47a7027e-ae17-4524-a920-947046e3389b (5) to 6 (0 -> 1)
I1129 15:44:08.400939  1719 flow_graph_manager.cc:951] Resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 is represented by node 4
I1129 15:44:08.401036  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.401103  1719 flow_graph_manager.cc:757] Adding capacity on edge from 181d78d7-0c42-4bcb-8c2b-36698d856c35 (4) to 5 (0 -> 1)
I1129 15:44:08.401196  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.401267  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.401666  1719 flow_graph_manager.cc:757] Adding capacity on edge from a9605cf6-f45c-4b50-95d0-318637033ffc (3) to 4 (0 -> 1)
I1129 15:44:08.401818  1719 flow_graph_manager.cc:951] Resource c2a50435-1b12-4ed0-946b-e8f907901452 is represented by node 2
I1129 15:44:08.401944  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.402052  1719 flow_graph_manager.cc:757] Adding capacity on edge from c2a50435-1b12-4ed0-946b-e8f907901452 (2) to 3 (0 -> 1)
I1129 15:44:08.402160  1719 flow_graph_manager.cc:385] Adding node for resource f2746005-abee-4363-958a-d7ed49cb836a (PU #1), type RESOURCE_PU
I1129 15:44:08.402251  1719 flow_graph_manager.cc:951] Resource 0369cffa-83ad-4438-802e-b39e704521ff is represented by node 9
I1129 15:44:08.402354  1719 flow_graph_manager.cc:423] Adding missing arc from parent 0369cffa-83ad-4438-802e-b39e704521ff(9) to f2746005-abee-4363-958a-d7ed49cb836a(11).
I1129 15:44:08.402451  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.402550  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.402640  1719 flow_graph_manager.cc:724] Considering node f2746005-abee-4363-958a-d7ed49cb836a, which has parent 0369cffa-83ad-4438-802e-b39e704521ff
I1129 15:44:08.402739  1719 flow_graph_manager.cc:727] Adding arc from leaf resource f2746005-abee-4363-958a-d7ed49cb836a to sink node.
I1129 15:44:08.402811  1719 flow_graph_manager.cc:951] Resource f2746005-abee-4363-958a-d7ed49cb836a is represented by node 11
I1129 15:44:08.402897  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.402961  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.403053  1719 flow_graph_manager.cc:951] Resource 0369cffa-83ad-4438-802e-b39e704521ff is represented by node 9
I1129 15:44:08.403118  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.403204  1719 flow_graph_manager.cc:757] Adding capacity on edge from 0369cffa-83ad-4438-802e-b39e704521ff (9) to 11 (0 -> 1)
I1129 15:44:08.403271  1719 flow_graph_manager.cc:951] Resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 is represented by node 7
I1129 15:44:08.403367  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.403439  1719 flow_graph_manager.cc:757] Adding capacity on edge from 47d9fbee-a671-4d2d-8ea5-b5e452972773 (7) to 9 (0 -> 1)
I1129 15:44:08.403533  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.403599  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.403684  1719 flow_graph_manager.cc:757] Adding capacity on edge from 47a7027e-ae17-4524-a920-947046e3389b (5) to 7 (0 -> 1)
I1129 15:44:08.403751  1719 flow_graph_manager.cc:951] Resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 is represented by node 4
I1129 15:44:08.403837  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.403900  1719 flow_graph_manager.cc:757] Adding capacity on edge from 181d78d7-0c42-4bcb-8c2b-36698d856c35 (4) to 5 (1 -> 2)
I1129 15:44:08.404158  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.404320  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.404417  1719 flow_graph_manager.cc:757] Adding capacity on edge from a9605cf6-f45c-4b50-95d0-318637033ffc (3) to 4 (1 -> 2)
I1129 15:44:08.404489  1719 flow_graph_manager.cc:951] Resource c2a50435-1b12-4ed0-946b-e8f907901452 is represented by node 2
I1129 15:44:08.404580  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.404644  1719 flow_graph_manager.cc:757] Adding capacity on edge from c2a50435-1b12-4ed0-946b-e8f907901452 (2) to 3 (1 -> 2)
I1129 15:44:08.404737  1719 pb_utils.cc:87] BFSTraversal of resource topology, reached a9605cf6-f45c-4b50-95d0-318637033ffc, invoking callback [1]
I1129 15:44:08.404821  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.404954  1719 flow_graph_manager.cc:325] Adding resource equiv classes for node 3
I1129 15:44:08.405284  1719 map-util.h:84] FindPtrOrNull:  key not found!
I1129 15:44:08.405441  1719 flow_graph_manager.cc:328]    EC: 401933894075308583
I1129 15:44:08.405544  1719 flow_graph_manager.cc:331]     Adding node EC node for 401933894075308583
I1129 15:44:08.405675  1719 flow_graph_manager.cc:144] Add equiv class 401933894075308583
I1129 15:44:08.405798  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.405917  1719 flow_graph_manager.cc:951] Resource a9605cf6-f45c-4b50-95d0-318637033ffc is represented by node 3
I1129 15:44:08.405995  1719 map-util.h:84] FindPtrOrNull:  key not found!
I1129 15:44:08.406225  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.406291  1719 map-util.h:88] FindPtrOrNull:  found key, not null
I1129 15:44:08.406442  1719 flow_graph_manager.cc:536]     adding arc from EC node 12 to 3 at cap 2, cost 0!
I1129 15:44:08.406581  1719 flow_graph_manager.cc:159] Adding equivalence class node, with change c AddEquivClassNode
n 12 0 0
a 12 3 0 2 0
I1129 15:44:08.406800  1719 flow_graph_manager.cc:951] Resource 181d78d7-0c42-4bcb-8c2b-36698d856c35 is represented by node 4
I1129 15:44:08.406903  1719 flow_graph_manager.cc:951] Resource 47a7027e-ae17-4524-a920-947046e3389b is represented by node 5
I1129 15:44:08.407006  1719 flow_graph_manager.cc:951] Resource 0dc390a0-d269-4350-acd2-bda6ed380193 is represented by node 6
I1129 15:44:08.407073  1719 flow_graph_manager.cc:951] Resource 47d9fbee-a671-4d2d-8ea5-b5e452972773 is represented by node 7
I1129 15:44:08.407160  1719 flow_graph_manager.cc:951] Resource 14671e39-238a-4e68-98c0-4c60184be6e0 is represented by node 8
I1129 15:44:08.407227  1719 flow_graph_manager.cc:951] Resource 0369cffa-83ad-4438-802e-b39e704521ff is represented by node 9
I1129 15:44:08.407318  1719 flow_graph_manager.cc:951] Resource 142edfe7-ff48-4a4d-9e78-2980130f56c0 is represented by node 10
I1129 15:44:08.407385  1719 flow_graph_manager.cc:951] Resource f2746005-abee-4363-958a-d7ed49cb836a is represented by node 11
I1129 15:44:08.407475  1719 flow_graph_manager.cc:1146] Considering resource a9605cf6-f45c-4b50-95d0-318637033ffc, which is 3
I1129 15:44:08.407546  1719 flow_graph_manager.cc:1146] Considering resource 181d78d7-0c42-4bcb-8c2b-36698d856c35, which is 4
I1129 15:44:08.407644  1719 flow_graph_manager.cc:1146] Considering resource 47a7027e-ae17-4524-a920-947046e3389b, which is 5
I1129 15:44:08.407711  1719 flow_graph_manager.cc:1146] Considering resource 0dc390a0-d269-4350-acd2-bda6ed380193, which is 6
I1129 15:44:08.407799  1719 flow_graph_manager.cc:1146] Considering resource 47d9fbee-a671-4d2d-8ea5-b5e452972773, which is 7
I1129 15:44:08.407866  1719 flow_graph_manager.cc:1146] Considering resource 14671e39-238a-4e68-98c0-4c60184be6e0, which is 8
I1129 15:44:08.408102  1719 flow_graph_manager.cc:1146] Considering resource 0369cffa-83ad-4438-802e-b39e704521ff, which is 9
I1129 15:44:08.408197  1719 flow_graph_manager.cc:1146] Considering resource 142edfe7-ff48-4a4d-9e78-2980130f56c0, which is 10
I1129 15:44:08.408288  1719 flow_graph_manager.cc:1146] Considering resource f2746005-abee-4363-958a-d7ed49cb836a, which is 11
I1129 15:44:08.408357  1719 flow_graph_manager.cc:1200] Updated resource topology in flow scheduler.
I1129 15:44:08.408454  1719 flow_scheduler.cc:449] Updating resource statistics in flow graph
I1129 15:44:08.408550  1719 map-util.h:84] FindPtrOrNull:  key not found!
F1129 15:44:08.408639  1719 octopus_cost_model.cc:191] Check failed: 'rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fc364fe5daa  (unknown)
    @     0x7fc364fe5ce4  (unknown)
    @     0x7fc364fe56e6  (unknown)
    @     0x7fc364fe8687  (unknown)
    @           0x613a9f  firmament::OctopusCostModel::GatherStats()
    @           0x643bd3  firmament::FlowGraphManager::ComputeTopologyStatistics()
    @           0x611a19  firmament::scheduler::FlowScheduler::UpdateCostModelResourceStats()
    @           0x612769  firmament::scheduler::FlowScheduler::RegisterResource()
    @           0x590f71  firmament::Coordinator::AddResource()
    @           0x6a4b05  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x5909d1  firmament::Coordinator::DetectLocalResources()
    @           0x5912d2  firmament::Coordinator::Run()
    @           0x551b95  main
    @     0x7fc361e5bec5  (unknown)
    @           0x55179d  (unknown)
    @              (nil)  (unknown)
Aborted (core dumped)

EDIT: The lines starting with "FindPtrOrNull:" were added by me, but the problem itself happens to me even with unmodified code (git sha 7566d50).

ms705 commented 8 years ago

Thanks for the report, verified to be an issue on HEAD.

I believe I have a fix to this somewhere on one of our test clusters; will dig and push it upstream. The issue has cropped up before, and it's related to the fact that the resource topology part of the flow graph isn't fully initialised when the scheduler makes its first pass cost model statistics update over it (see the call chain of Coordinator::AddResource() --> scheduler::FlowScheduler::RegisterResource() --> scheduler::FlowScheduler::UpdateCostModelResourceStats()).

ms705 commented 8 years ago

Fix now under review in CL 254197.

ms705 commented 8 years ago

Fixed in d5f9dc3e63521a192b46ff9e77cabadba9bc262d.