Open FroMage opened 10 years ago
collections?
I suppose so.
What do people typically use BitSet
s for, BTW?
As an untypesafe alternative to Set<EnumType>
?
I think I only used them once myself when I needed a really large number of Booleans, but BitSet in Java allows for boolean operations and such, so it's like a kind of dynamic length binary "number" (in quotes because you can't do anything numeric).
BitSet
s are very useful for filtering. You can generate a bit set from a list of objects or whatever; every bit represents an element from the original list, and the set bits represent elements that meet a certain condition. You can then have several bit sets from the same list, each one representing different properties from those elements, and do bitwise operations to find which elements have certain properties.
Before you say "oh but you could just iterate once over the elements with a function that filters out all the conditions you need", well yeah but sometimes with large unchanging datasets you just generate a bit set once and store it, and then when a new filter is needed you generate a new bit set that you can use along with the ones you already have.
A possible use-case from personal experience could have been for the parsing of message packets sent by radio (like AIS (http://en.wikipedia.org/wiki/Automatic_Identification_System). They came in variable length packets (I think the largest was slightly over 300 bits long) filled with highly packed data like: bits 0-6 vessel id, bits 7-9 vessel type, bits 10-14 length in feet * 10 (etc etc) I didn't use this class because it's too focused on single bits while I had to deal in groups of bits so I wrote something myself, but things like this might be a possible use case-case.
In a project, I wanted to use BitSet to do such filtering on a large dataset. It was abandoned because of poor performance. Later, I found out there is an alternative to this data type: EWAH. There is a Java implementation claiming high performance, using a non-patented algorithm, with Apache license... https://code.google.com/p/javaewah/
The JDK name BitSet is highly misleading. It's a set of integers (actually natural numbers), usually small ones since memory is proportional to the max integer used. 'Bit' just refers to the private implementation. So better names would be: IntegerSet? NaturalSet? SmallIntSet?
It should implement MutableSet<Integer> and SortedSet<Integer>. Methods from in Java's BitSet: void set(int bitIndex) should become Boolean add(Integer); boolean get(int bitIndex) should become Boolean contains(Integer); void or(BitSet that) should be Boolean addAll({Integer*} that) and so on. You get the idea...
@JHFinn It depends if the proposal is to reimplement Java's BitSet or if we only want a real "bit set", or large list of booleans. I think the latter is the scope of this proposal, and I see it as the most useful use case.
@PhiLhoSoft I see what you mean: Java's BitSet could also be interpreted as List<Boolean>. I haven't given this much thought, but a SmallIntSet and a BooleanList are different concepts with different APIs (e.g. Boolean add(Integer) versus Boolean add(Boolean)). Maybe we should have both. Neither is a 'BitSet' which, if it means anything, would be one of four possible sets: the empty set {}, {0}, {1}, or {0, 1}.
Well, BitSets are usually useful in place of using bitmasks and bitshifts, which are limited in readability and the storage size of the int
or long
you store them into. A BitSet is a set of "named" (or indexed) bits that are set or not, not a set of a single 0/1.
For example, a BitSet is typically used in place of the following:
class Operations{
final static int ADD = 1 << 0;
final static int MUL = 1 << 1;
final static int SUB = 1 << 2;
final static int DIV = 1 << 3;
int ops; // this is a BitSet of 32 length max
public Operations(int ops){ this.ops = ops; }
public boolean isAdd(){
return (ops & ADD) != 0; // checks if we have the "add" bit set
}
}
Is that clearer?
I suppose your confusion is that you view "bit" as a value of 0/1, while for people using BitSets they view a bit as the compound storage units of a sequence of bytes. For example, a byte is a sequence of 8 different bits. We could just as well call it a BitList or BitMap, but that's like trying to rename Map into something else.
My main objection is to the name BitSet, because that strongly implies it's a Set<Bit> (meaning Set<Boolean> ??), which it is not. Your example is logically a Set<Integer> where ADD is 0, MUL is 1, SUB is 2, DIV is 3. That's a perfectly reasonable way to use this class, and a name like IntegerSet / NaturalSet / SmallIntSet would be appropriate. To check if ADD is present, you'd use the normal set operation contains(ADD).
Well, no, if anything, it's a Map<Integer,Boolean>
. But it's been called bit-set since the dawn of computing. It can logically be viewed as a set too, since it can either contain the ADD
bit, or not, but not multiple times.
In my main project, I use Java's BitSet to track ID numbers of objects I read from XML; I print a warning if an object has an ID number that has already been used, and if the user asks the software to do something that creates an object, it sets that object's ID number to the lowest unused number; with about 280,000 IDs in practice, and about 12% above 1,000,000, I find the ability to find the first unused "bit" very valuable. In my port of this project to Ceylon I've just continued to use the BitSet from the JDK, but I would welcome a pure-Ceylon implementation.
An interesting story about a real world usage of BitSet: https://plumbr.eu/blog/performance-blog/squeezing-data-into-the-data-structure
I thought it was worth sharing... :smile:
A couple thoughts:
BitSet
seems like a good name to me. No one's going to really think it's a Set<Bit>
, and calling it IntegerSet
hides its special purpose nature. For example, you might think it's ok to include maxIntegerValue
in an IntegerSet
, but not a BitSet
.I created a basic BitSet satisfies Set<Integer>
implementation in terms of Whole
here: https://github.com/jvasileff/ceylon-bitset
It would be more efficient to implement this with an array of Integer64
, or even better, Integer32
, but we don't have either of those in the SDK.
Now, despite offering an implementation, I'm not completely sold on its value, so I'm not going to create a PR unless someone asks. And, I'm not sure where this class would belong. You wouldn't want to import ceylon.collection
if all you need is BitSet
, and further, adding ceylon.whole
to ceylon.colleciton
doesn't sound great.
Bitsets are often used as an implementation detail of bloom filters.
Just a thought.
Something similar to what's in the JDK, not sure which module it would belong to, collections?