Shiv's Talk

Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs Konstantinos Kanellis, Ramnatthan Alagappan, and Shivaram Venkataraman https://www.usenix.org/system/files/hotstorage20_paper_kanellis.pdf

ML for DB configuration tuning:

Random Forest => predict the best config.
When there's a group of parameters, we may need to first do some experiments to reduce the parameter space.

ML Systems: DL workloads scheduling

Themis: Fair scheduling.

The key features of DL workloads:

Gang scheduling: All required resources are being used at the same time. (valid for both data and model parallel)
Monolithic resources: One workload one GPU. (Not really, there is a bunch of techniques for GPU sharing/virtualization. But of course, on the view of virtualized devices, it is true. )
Locality: Collective communication: Closer devices talk faster.

Sharing incentive(SI):

The worst performance of N devices sharing one public resource [should not be less than] that of one device owning 1/N private resource.

Interface Get \rho estimates via Agent

Metric： Fairness = Tsh/Tid

Strawman Mechanism: I didn't quite understand how they actually operate in this step... Maybe I should look at the paper...

Observations: Avg work hours = 3.7 with most app 5X longer and 5X shorter.

Other systems: DRF: Allocate on task completion to Min Metric(No preemption). Short tasks may wait for long-term jobs for a long time. Tiresias: Metric: GPU allocated * time; Allocate resources to those with MIN metric. DRAWBACK: ignores locality.

ganler / ResearchReading

[MLOS Seminar] Systems and ML: Opportunities and Challenges for Symbiotic Research #33

Shiv's Talk