eclipse-archived / ceylon-sdk

Standard platform modules belonging to the Ceylon SDK
http://www.ceylon-lang.org
Apache License 2.0
72 stars 60 forks source link

Add BitSet #177

Open FroMage opened 10 years ago

FroMage commented 10 years ago

Something similar to what's in the JDK, not sure which module it would belong to, collections?

gavinking commented 10 years ago

collections?

I suppose so.

What do people typically use BitSets for, BTW?

gavinking commented 10 years ago

As an untypesafe alternative to Set<EnumType>?

quintesse commented 10 years ago

I think I only used them once myself when I needed a really large number of Booleans, but BitSet in Java allows for boolean operations and such, so it's like a kind of dynamic length binary "number" (in quotes because you can't do anything numeric).

chochos commented 10 years ago

BitSets are very useful for filtering. You can generate a bit set from a list of objects or whatever; every bit represents an element from the original list, and the set bits represent elements that meet a certain condition. You can then have several bit sets from the same list, each one representing different properties from those elements, and do bitwise operations to find which elements have certain properties.

Before you say "oh but you could just iterate once over the elements with a function that filters out all the conditions you need", well yeah but sometimes with large unchanging datasets you just generate a bit set once and store it, and then when a new filter is needed you generate a new bit set that you can use along with the ones you already have.

quintesse commented 10 years ago

A possible use-case from personal experience could have been for the parsing of message packets sent by radio (like AIS (http://en.wikipedia.org/wiki/Automatic_Identification_System). They came in variable length packets (I think the largest was slightly over 300 bits long) filled with highly packed data like: bits 0-6 vessel id, bits 7-9 vessel type, bits 10-14 length in feet * 10 (etc etc) I didn't use this class because it's too focused on single bits while I had to deal in groups of bits so I wrote something myself, but things like this might be a possible use case-case.

PhiLhoSoft commented 10 years ago

In a project, I wanted to use BitSet to do such filtering on a large dataset. It was abandoned because of poor performance. Later, I found out there is an alternative to this data type: EWAH. There is a Java implementation claiming high performance, using a non-patented algorithm, with Apache license... https://code.google.com/p/javaewah/

JHFinn commented 9 years ago

The JDK name BitSet is highly misleading. It's a set of integers (actually natural numbers), usually small ones since memory is proportional to the max integer used. 'Bit' just refers to the private implementation. So better names would be: IntegerSet? NaturalSet? SmallIntSet?

JHFinn commented 9 years ago

It should implement MutableSet<Integer> and SortedSet<Integer>. Methods from in Java's BitSet: void set(int bitIndex) should become Boolean add(Integer); boolean get(int bitIndex) should become Boolean contains(Integer); void or(BitSet that) should be Boolean addAll({Integer*} that) and so on. You get the idea...

PhiLhoSoft commented 9 years ago

@JHFinn It depends if the proposal is to reimplement Java's BitSet or if we only want a real "bit set", or large list of booleans. I think the latter is the scope of this proposal, and I see it as the most useful use case.

JHFinn commented 9 years ago

@PhiLhoSoft I see what you mean: Java's BitSet could also be interpreted as List<Boolean>. I haven't given this much thought, but a SmallIntSet and a BooleanList are different concepts with different APIs (e.g. Boolean add(Integer) versus Boolean add(Boolean)). Maybe we should have both. Neither is a 'BitSet' which, if it means anything, would be one of four possible sets: the empty set {}, {0}, {1}, or {0, 1}.

FroMage commented 9 years ago

Well, BitSets are usually useful in place of using bitmasks and bitshifts, which are limited in readability and the storage size of the int or long you store them into. A BitSet is a set of "named" (or indexed) bits that are set or not, not a set of a single 0/1.

For example, a BitSet is typically used in place of the following:

class Operations{
 final static int ADD = 1 << 0;
 final static int MUL = 1 << 1;
 final static int SUB = 1 << 2;
 final static int DIV = 1 << 3;

 int ops; // this is a BitSet of 32 length max

 public Operations(int ops){ this.ops = ops; }

 public boolean isAdd(){
  return (ops & ADD) != 0; // checks if we have the "add" bit set
 }
}

Is that clearer?

FroMage commented 9 years ago

I suppose your confusion is that you view "bit" as a value of 0/1, while for people using BitSets they view a bit as the compound storage units of a sequence of bytes. For example, a byte is a sequence of 8 different bits. We could just as well call it a BitList or BitMap, but that's like trying to rename Map into something else.

JHFinn commented 9 years ago

My main objection is to the name BitSet, because that strongly implies it's a Set<Bit> (meaning Set<Boolean> ??), which it is not. Your example is logically a Set<Integer> where ADD is 0, MUL is 1, SUB is 2, DIV is 3. That's a perfectly reasonable way to use this class, and a name like IntegerSet / NaturalSet / SmallIntSet would be appropriate. To check if ADD is present, you'd use the normal set operation contains(ADD).

FroMage commented 9 years ago

Well, no, if anything, it's a Map<Integer,Boolean>. But it's been called bit-set since the dawn of computing. It can logically be viewed as a set too, since it can either contain the ADD bit, or not, but not multiple times.

kingjon3377 commented 9 years ago

In my main project, I use Java's BitSet to track ID numbers of objects I read from XML; I print a warning if an object has an ID number that has already been used, and if the user asks the software to do something that creates an object, it sets that object's ID number to the lowest unused number; with about 280,000 IDs in practice, and about 12% above 1,000,000, I find the ability to find the first unused "bit" very valuable. In my port of this project to Ceylon I've just continued to use the BitSet from the JDK, but I would welcome a pure-Ceylon implementation.

PhiLhoSoft commented 9 years ago

An interesting story about a real world usage of BitSet: https://plumbr.eu/blog/performance-blog/squeezing-data-into-the-data-structure

I thought it was worth sharing... :smile:

jvasileff commented 8 years ago

A couple thoughts:

I created a basic BitSet satisfies Set<Integer> implementation in terms of Whole here: https://github.com/jvasileff/ceylon-bitset

It would be more efficient to implement this with an array of Integer64, or even better, Integer32, but we don't have either of those in the SDK.

Now, despite offering an implementation, I'm not completely sold on its value, so I'm not going to create a PR unless someone asks. And, I'm not sure where this class would belong. You wouldn't want to import ceylon.collection if all you need is BitSet, and further, adding ceylon.whole to ceylon.colleciton doesn't sound great.

luolong commented 8 years ago

Bitsets are often used as an implementation detail of bloom filters.

Just a thought.