facebookresearch / CompilerGym

Reinforcement learning environments for compiler and program optimization tasks
https://compilergym.ai/
MIT License
911 stars 129 forks source link

Extend RPC interface to support multi-file benchmarks #325

Open ChrisCummins opened 3 years ago

ChrisCummins commented 3 years ago

🚀 Feature

Add support for benchmarks which comprise multiple files.

Motivation

Presently the Benchmark protobuf requires that each benchmark is a single file (called the "program"):

// Representation of the input to a compiler.
message Benchmark {
  // The name of the benchmark to add. In case of conflict with an existing
  // benchmark, this new benchmark replaces the existing one.
  string uri = 1;
  // The description of the program that is being compiled. It is up to the
  // service to determine how to interpret this file, and it is the
  // responsibility of the client to ensure that it provides the correct format.
  // For example, the service could expect that this file contains serialized
  // IR data, or an input source file.
  File program = 2;
  // An optional configuration option that details how to build and run the
  // benchmark program.
  BenchmarkDynamicConfig dynamic_config = 3;
}

In the LLVM environment, we presently support multi-file benchmarks by linking all of the source files into a single LLVM module to construct a benchmark. This means that we can't support benchmarks for which the module that we are optimizing is only a single component of a larger program.

Pitch

Replace the single file "program" field with a new "files" mapping which maps file names to file contents, and a "main" field which tells the compiler which input is the "main" one (if applicable):

// Representation of the inputs to a compiler.
message Benchmark {
  // The name of the benchmark to add. In case of conflict with an existing
  // benchmark, this new benchmark replaces the existing one.
  string uri = 1;
  // A mapping from file name to file contents for each of the components that
  // makes up this benchmark. It is up to the service to determine how to
  // interpret these files, and it is the responsibility of the client to ensure
  // that it provides the correct format. For example, the service could expect
  // that each file contains serialized IR data, or an input source file.
  map<string, bytes> files = 4;
  // A key into the "files" mapping that identifies the "main" file for a
  // benchmark, if applicable.
  string main = 5;
  // An optional configuration option that details how to build and run the
  // benchmark program.
  BenchmarkDynamicConfig dynamic_config = 4;
  // Deprecated fields:
  reserved 3;
  reserved "program";
}

For example:

Benchmark {
  uri = "benchmark://gcc-v0/my-app"
  files {
    "src/main.c": "..."
    "include/header.h": "..."
    "src/util.cc": "..."
  }
  main = "src/main.c"
}

As in the current version, it would still be up to each client/service implementation to agree on the expected file formats for these program files, and how to interpret multi-file inputs.

Alternatives

Alternative 1: do nothing

The proposed new functionality can already be achieved using the existing Benchmark schema:

  1. The client zips up all of the multi-file inputs into a single archive. If there is a main file, the client could create a file like main.txt containing its relative path.
  2. The service receives the zip file from the client, unpacks it, and reads the main.txt file, if present.

The downside of this is the extra implementation complexity for the client/service, and the runtime overhead of packing/unpacking the archive.

ChrisCummins commented 3 years ago

Thanks @hughleat for the discussion on this.

ChrisCummins commented 2 years ago

@KyleHerndon, are you still interested in implementing this as per the discussions in #584?

Cheers, Chris