bazelbuild / starlark

Starlark Language
Apache License 2.0
2.44k stars 163 forks source link

proposal: add a `set` data type #264

Open dieortin opened 1 year ago

dieortin commented 1 year ago

A set data structure, similar to the one in Python, would be a useful addition. For example, it is sometimes required to ensure no duplicate values exist in a list, and a set is the most ergonomic solution.

names = ["Thomas", "David", "Michelle", "David"] # contains duplicate names
unique_names = set(names) # building a set from the list removes duplicates

Other common use cases are:

Thoughts? Should I make an attempt at writing a proposal?

Use case example

As an example, I have run into this need while adding automatic dependency exploration in my fork of kklochkov/rules_qt, a ruleset for building Qt applications with Bazel. When generating a rule instantiation, the deps attribute cannot contain duplicate values. Because my feature analyzes Qt shared libraries in the system, and they can be duplicated with different names (for example, libQt6Quick.so and libQt6Quick.so.6.4.3) This introduces the need to enforce each dependency is only added once. Of course, there are ways to work around this problem, but it would be easily solved with a set.

ndmitchell commented 1 year ago

The standard recommendation thus far has been to use a dict with None as the values. I've used that in practice and while it's not quite as nice, it's perfectly workable.

suganoi commented 1 year ago

@dieortin @ndmitchell It's already half-done in Go: https://github.com/google/starlark-go/issues/492

dieortin commented 1 year ago

@suganoi that's nice! I have also just seen https://github.com/bazelbuild/starlark/issues/20, but it doesn't seem to include the existence of set.

IMO the set API selected for the Go version makes sense and should be added to the general spec. @laurentlb what do you think?

laurentlb commented 9 months ago

For the record, the example above can be written:

names = ["Thomas", "David", "Michelle", "David"] # contains duplicate names
unique_names = {n: None for n in names}.keys()

In many cases, people want sets just to remove duplicates; this workaround may be sufficient for that use-case.

In Bazel BUILD files, there's a tradition of using lists in most places. For example, deps and srcs are lists, even when the order is not relevant and duplicates are forbidden. In Bazel rules, sets should usually be avoided, and depsets should be used most of the time. So there were concerns that adding a native type set could cause more problems to Bazel. When a set is really needed in Bazel, a set implementation is provided in skylib (https://github.com/bazelbuild/bazel-skylib/blob/main/lib/new_sets.bzl).

As mentioned previously, Starlark-Go has a implementation for sets (https://github.com/google/starlark-go/blob/master/doc/spec.md#set - https://github.com/google/starlark-go/blob/90ade8b19d09805d1b91a9687198869add6dfaa1/starlark/library.go#L990). It can be used using a flag.

tetromino commented 5 months ago

I'm not opposed to a set data type, although if we do add it, I think we ought to also support the literal {foo, bar, ...} syntax. I regularly field questions from new macro and rule authors asking for where is Starlark's set data type :) In Bazel, it would entail a bit of work to chase down the native methods that currently require a starlark list as argument but which could also take a set as argument where it makes sense. And of course we'd need clear documentation for when to use sets vs. depsets.

stepancheg commented 5 months ago

I think we ought to also support the literal {foo, bar, ...} syntax

This is not as easy decision: people constantly make mistakes by writing {} assuming it is an empty set (and in many cases it behaves like empty set, e.g. in in operator or iterator, making it harder to debug). And writing set([foo, bar, ...]) is OK-ish, and has no perf difference if optimized properly.

I suggest to have separate discussions/decisions for set() builtin, and {...} syntax for sets.

tomaspipes commented 4 months ago

Given the discussions ... is there any status on the set data type feature?

tetromino commented 4 months ago

@tomaspipes - given the positive reaction from the community here and elsewhere, I want to take a few days this summer to add a set data type to the Java implementation, and have it enabled in Bazel 8. The Go implementation already has set behind a flag, we may want to flip the flag to enabled.