camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Failed to run the simulator #66

Open marceloamaral opened 6 years ago

marceloamaral commented 6 years ago

Hi, I am trying to run the simulation with synthetic trace, but it is failing. I have tried the solvers (cs2 and flowlessly), and the cost models (0 and 6). The general configuration that I have used is as shown below:

build-release/src/simulator \ --simulation=synthetic \ --synthetic_num_jobs=100 \ --synthetic_num_machines=10 \ --synthetic_machine_failure_duration=0 \ --synthetic_task_duration=2 \ --synthetic_tasks_per_job=2 \ --runtime=100000000000 \ --scheduler=flow \ --flow_scheduling_cost_model=6 \ --preemption \ --simulated_dfs_type=bounded \ --simulated_block_size=1073741824 \ --max_sample_queue_size=10 \ --solver=cs2 \ --log_solver_stderr \ --max_solver_runtime=100000000000 \ --machine_tmpl_file=../../tests/testdata/mach_16pus.pbin \ --generate_trace \ --generated_trace_path=firmament/results/simu-release/trace-path \ --generate_quincy_cost_model_trace \ --log_dir=firmament/results/simu-release/log \ --quincy_no_scheduling_delay \ --online_factor 1 -v 10


For the cost model 0, its is failing with the following error: F0425 18:59:30.177265 24372 trivial_cost_model.cc:139] Check failed: leaf_resids->size() >= FLAGS_num_pref_arcs_task_to_res (0 vs. 1) The traces: results/simu-release/trace-path/task_events/part-00000-of-00500.csv 1000000,,1,1,,0,,,,,,, results/simu-release/trace-path/machine_events/part-00000-of-00001.csv 0,1,0,,, 0,2,0,,, 0,3,0,,, 0,4,0,,, 0,5,0,,, 0,6,0,,, 0,7,0,,, 0,8,0,,, 0,9,0,,, 0,10,0,,,


For the cost model 6, its is failing with the following error: Error in `firmament/build-release/src/simulator': corrupted double-linked list: 0x000000000133bdb0 The log is saying: W0425 18:59:58.946012 24390 trace_generator.cc:264] 100% of tasks are unscheduled results/simu-release/trace-path/scheduler_events/scheduler_events.csv 1000000,388,0,930,2,0,0,2,2,10,16,0,25,1,0,2,11,1,1,1,2,0,10,0,10,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 1000388,358,0,934,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1000746,459,0,1085,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1001205,460,0,1067,2,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 etc... results/simu-release/trace-path/task_events/part-00000-of-00500.csv 1000000,,1,1,,0,,,,,,, 1000000,,1,2,,0,,,,,,, 2000000,,2,1,,0,,,,,,, 2000000,,2,2,,0,,,,,,, 3000000,,3,1,,0,,,,,,, 3000000,,3,2,,0,,,,,,,

results/simu-release/trace-path/machine_events/part-00000-of-00001.csv 0,1,0,,, 0,2,0,,, 0,3,0,,, 0,4,0,,, 0,5,0,,, 0,6,0,,, 0,7,0,,, 0,8,0,,, 0,9,0,,, 0,10,0,,, 1012484,10,1,,, 1012484,10,0,,,


For the COCO model, it is generating at least some traces, but for the TRIVIAL, it has almost no trace generated. I am most probably missing some configuration. Please, could you help me?

Thanks!

ms705 commented 6 years ago

Hi @marceloamaral,

This is definitely not expected behavior. @ICGog and I are currently quite busy with a major deadline, but we can dig into this after May 3rd.

I saw that you posted a bunch of other comments debugging this (which strangely GitHub doesn't show for me). One reason the machine_tmpl->ParseFromFileDescriptor(fd) call would fail is if it cannot find the machine template .pbin file. Check the path passed in the command line argument -- IIRC, the default path is relative to the Firmament root directory, and it may be incorrect if you run the simulator binary from a different directory.

marceloamaral commented 6 years ago

Hi @ms705, thank you for replying!

Unfortunately, I was using an older source code version, the one from the branch https://github.com/Huawei-PaaS/firmament. After I realized that, I have updated to the master version from this repository.

The good news is that It solved the previous problem of machine_tmpl->ParseFromFileDescriptor(fd). As you mentioned, it was most probably some paths that were wrong. However, I am still getting some problems... The problem now is:

F0427 17:13:55.563627 7817 flow_graph_node.cc:43] Check failed: InsertIfNotPresent(&outgoing_arcmap, arc->dst_, arc)

This error is from the call CHECK(InsertIfNotPresent(&outgoing_arcmap, arc->dst_, arc)); Debugging this, I found that the call InsertIfNotPresent returns false if the arch was updated and true only if a new one is inserted. So, my question here is, is it expected that the arcs will never be updated in this call? If yes, why/which situations it might fail and update the arc.

In a quick "try and error approach", I have disabled this CHECK just in case. However, it just "transferred" the problem to the flowlessly module as shown in the log below. The log shows that the flowlessly module has errors, but the simulation itself keep executing, but the log stop to show new data, maybe it reaches a deadlock situation... I'm not sure... Anyway, i believe that there is a strong constraint in update the arcs in this situation.

Might it be related to the simulation parameters? For which parameters can you run it successfully? Also, typically, with the default parameters from the synthetic trace, that is 100 jobs and 10 machines, how long does it take to execute?

Please find below the used parameters that I am using and the logs of the execution:


--simulation=synthetic
--synthetic_num_jobs=100
--synthetic_num_machines=10
--synthetic_machine_failure_duration=0
--synthetic_machine_failure_rate=0
--synthetic_task_duration=10000000
--synthetic_tasks_per_job=2
--synthetic_task_runtime=1000000
--runtime=1000000000000
--scheduler=flow
--flow_scheduling_cost_model=0
--preemption
--simulated_dfs_type=bounded
--simulated_block_size=1073741824
--max_sample_queue_size=10
--online_factor=0.1
--solver=flowlessly
--log_solver_stderr
--max_solver_runtime=100000000
--machine_tmpl_file=../../tests/testdata/mach_16pus.pbin
--generate_trace
--generated_trace_path=/home/mamaral/workspace/firmament/results/simu-debug/trace-path
--generate_quincy_cost_model_trace
--log_dir=/home/mamaral/workspace/firmament/results/simu-debug/log
--quincy_no_scheduling_delay
--incremental_flow=false
--flowlessly_algorithm=cycle_cancelling
--trace_path=/home/mamaral/workspace/firmament/results/simu-debug/trace-path
--debug_tasks
--debug_flow_graph=true
--log_solver_stderr=true
-v=6


simulator.WARNING W0427 17:43:52.968793 21438 solver_dispatcher.cc:143] STDERR from solver: I0427 17:43:52.968731 21437 flow_scheduler.cc:186] ----- Cycle cancelling min cost flow ----- W0427 17:43:53.005383 21441 solver_dispatcher.cc:143] STDERR from solver: I0427 17:43:53.005316 21440 flow_scheduler.cc:186] ----- Cycle cancelling min cost flow ----- W0427 17:43:53.031674 21441 solver_dispatcher.cc:143] STDERR from solver: F0427 17:43:53.031594 21440 adjacency_map_graph.cc:111] The arc (174, 55) already exists W0427 17:43:53.031760 21441 solver_dispatcher.cc:143] STDERR from solver: Check failure stack trace: W0427 17:43:53.031956 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x7f6d1db255cd google::LogMessage::Fail() W0427 17:43:53.032065 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x7f6d1db27433 google::LogMessage::SendToLog() W0427 17:43:53.032189 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x7f6d1db2515b google::LogMessage::Flush() W0427 17:43:53.032292 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x7f6d1db27e1e google::LogMessageFatal::~LogMessageFatal() W0427 17:43:53.032336 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x42156e flowlessly::AdjacencyMapGraph::AddArc() W0427 17:43:53.032364 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x42451c flowlessly::Graph::ReadGraph() W0427 17:43:53.032426 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x40f040 ReadGraph() W0427 17:43:53.032452 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x41017a RunSolver() W0427 17:43:53.032517 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x40def4 main W0427 17:43:53.032644 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x7f6d1cdcf830 __libc_start_main W0427 17:43:53.032707 21441 solver_dispatcher.cc:143] STDERR from solver: @ 0x40edc9 _start W0427 17:43:53.032794 21441 solver_dispatcher.cc:143] STDERR from solver: @ (nil) (unknown)


log/simulator.INFO

[...] I0427 17:43:52.999543 21435 event_driven_scheduler.cc:589] Next active task considered is 573367269254525211 I0427 17:43:53.000797 21435 event_driven_scheduler.cc:539] Performing lazy graph reduction I0427 17:43:53.000808 21435 event_driven_scheduler.cc:544] for a job with 0 outputs I0427 17:43:53.000823 21435 event_driven_scheduler.cc:589] Next active task considered is 573367269254525212 I0427 17:43:53.000828 21435 event_driven_scheduler.cc:589] Next active task considered is 573367269254525211 I0427 17:43:53.000854 21435 event_driven_scheduler.cc:539] Performing lazy graph reduction I0427 17:43:53.000859 21435 event_driven_scheduler.cc:544] for a job with 0 outputs I0427 17:43:53.000872 21435 event_driven_scheduler.cc:589] Next active task considered is 15563142114659772486 I0427 17:43:53.000893 21435 event_driven_scheduler.cc:589] Next active task considered is 15563142114659772487 I0427 17:43:53.000924 21435 event_driven_scheduler.cc:539] Performing lazy graph reduction I0427 17:43:53.000932 21435 event_driven_scheduler.cc:544] for a job with 0 outputs I0427 17:43:53.000944 21435 event_driven_scheduler.cc:589] Next active task considered is 15563142114659772486 I0427 17:43:53.000950 21435 event_driven_scheduler.cc:589] Next active task considered is 15563142114659772487 I0427 17:43:53.000972 21435 flow_scheduler.cc:625] Updating resource statistics in flow graph I0427 17:43:53.001760 21435 flow_scheduler.cc:495] Flow scheduler updating time-dependent costs. I0427 17:43:53.002234 21435 solver_dispatcher.cc:182] Writing flow graph debug info into /tmp/firmament-debug/debug_1.dm I0427 17:43:53.003440 21435 utils.cc:307] Executing externally: /home/mamaral/workspace/firmament/build-debug/third_party/flowlessly/src/flowlessly-build/flow_scheduler I0427 17:43:53.003482 21435 utils.cc:341] External execution of command: /home/mamaral/workspace/firmament/build-debug/third_party/flowlessly/src/flowlessly-build/flow_scheduler --graph_has_node_types=true --algorithm=cycle_cancelling --print_assignments=true --daemon=false caling_factor= I0427 17:43:53.003834 21435 utils.cc:388] Subprocess with PID 21440 created. I0427 17:43:53.003870 21435 solver_dispatcher.cc:217] Solver running (PID: 21440), CHILD_READ: 16, CHILD_WRITE_STD: 19, CHILD_WRITE_ERR: 21, PARENT_WRITE: 17, PARENT_READ_STD: 18, PARENT_READ_ERR: 20 W0427 17:43:53.005383 21441 solver_dispatcher.cc:143] STDERR from solver: I0427 17:43:53.005316 21440 flow_scheduler.cc:186] ----- Cycle cancelling min cost flow ----- W0427 17:43:53.031674 21441 solver_dispatcher.cc:143] STDERR from solver: F0427 17:43:53.031594 21440 adjacency_map_graph.cc:111] The arc (174, 55) already exists W0427 17:43:53.031760 21441 solver_dispatcher.cc:143] STDERR from solver: Check failure stack trace: W0427 17:43:53.031956 21441 [it stops in this line...]

marceloamaral commented 6 years ago

To debug a little bit more; I started to print all the AddArc src dst and type, and it turns out that it is failing when "adding" a scheduled task. Therefore, since the updating a scheduled task is typically related to preemption, I have disabled the preemption from the parameters. As expected, It stopped to fail! So it seems to be working without preemption.

Additionally, is it written somewhere how to enable the statistics? I mean, after running the simulation, the some generated trace are empty, such as the "task_usage_stat"...

What do the columns mean in the traces? For example, the "scheduler_events.csv" has many columns.


results/simu-debug/trace-path/scheduler_events/scheduler_events.csv 1000000,32977,29,38547,0,0,0,2,2,2,41,0,78,4,7,2,35,2,1,1,4,2,2,0,34,2,1,1,32,0,0,0,0,0,0,0,0,0,7,0,0,0,0,3,0,0,0,0,1,0,0 2000000,28278,41,33081,0,0,0,4,2,0,4,0,11,9,8,2,0,1,1,0,4,2,0,0,0,2,1,2,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,3,0,2,0,2,0,2,0

results/simu-debug/trace-path/machines_to_racks/machines_to_racks.csv

results/simu-debug/trace-path/quincy_tasks/quincy_tasks.csv

results/simu-debug/trace-path/task_usage_stat/task_usage_stat.csv

results/simu-debug/trace-path/jobs_num_tasks/jobs_num_tasks.csv 1,2 2,2

results/simu-debug/trace-path/task_events/part-00000-of-00500.csv 1000000,,1,1,,0,,,,,,, 1000000,,1,2,,0,,,,,,, 1000000,,1,1,1,1,,,,,,, 1000000,,1,2,2,1,,,,,,, 2000000,,2,1,,0,,,,,,, 2000000,,2,2,,0,,,,,,, 2000041,,2,1,1,1,,,,,,, 2000041,,2,2,1,1,,,,,,, 11000000,,1,1,1,4,,,,,,, 11000000,,1,2,2,4,,,,,,, 12000041,,2,1,1,4,,,,,,, 12000041,,2,2,1,4,,,,,,,

results/simu-debug/trace-path/machine_events/part-00000-of-00001.csv 0,1,0,,, 0,2,0,,,

results/simu-debug/trace-path/tasks_to_blocks/tasks_to_blocks.csv

results/simu-debug/trace-path/task_runtime_events/task_runtime_events.csv 1,1,1,1000000,10000000,10000000,1 1,2,1,1000000,10000000,10000000,1 2,1,2,2000000,10000000,10000000,1 2,2,2,2000000,10000000,10000000,1

results/simu-debug/trace-path/dfs_events/dfs_events.csv

ms705 commented 5 years ago

@marceloamaral The schema of the CSV files generated is generally the same as the schema of the Google cluster trace. We added some additional files, and for those, you'll find the schema in the code that generates the file (or you can ask us).

To get data in the task_usage_stats.csv file, you want to pass the argument --aggregate_task_usage.