Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
609 stars 353 forks source link

Problem of SizeOf with large data #193

Closed Aunsiels closed 3 years ago

Aunsiels commented 4 years ago

Hello there,

I have some issue concerning the class https://github.com/Waikato/moa/blob/master/moa/src/main/java/moa/core/SizeOf.java . When asking for the fullSizeOf, you are currently delegating this task to sizeof.agent.SizeOfAgent and this seems to create a problem.

The first (minor) problem is that the link to the source code of this package is dead (http://www.jroller.com/maxim/entry/again_about_determining_size_of). I was able to find a source code on this page https://jar-download.com/artifacts/com.github.fracpete/sizeofag/1.0.2/source-code/sizeof/agent/SizeOfAgent.java and I am assuming this is the one you are using.

The second problem appears when asking the fullSizeOf a very big object. Internally, sizeof.agent.SizeOfAgent uses an IdentityHashMap (for no reason I understand), but this data structure has a maximum capacity of 2^29 (source: https://www.cc.gatech.edu/computing/pag/tmp/html_dir/java/util/IdentityHashMap.java.html), which creates a problem with big objects. For me, they use the IdentityHashMap as a Set, so I guess replacing it by a Set will solve the problem. However, you need to integrate and modify the external package sizeof.agent.SizeOfAgent.

fracpete commented 4 years ago

Interesting that you didn't use the Maven coordinates from the pom.xml to look up the sizeofag artifact on Maven Central, which would have pointed you to the source code repository. I'm not the author of the library, but the way I understand the use of the IdentityHashMap is to ensure that unique objects are inspected for their size. Classes can override the hashCode() and equals() methods, rendering the detection, whether an object has already been inspected, useless (you can count far less objects than there are).

Aunsiels commented 4 years ago

True for the pom.xml. I did not see that the original code was on your GitHub. Still, the link to the original reference is dead, even on your git.

The problem remains on this code: https://github.com/fracpete/sizeofag/blob/master/src/main/java/sizeof/agent/SizeOfAgent.java. I understand why the IdentityHashMap can be used here but I do not understand why it is used. An IdentityHashMap is a Map but the Map aspect is completely ignored: the only methods used are visited.put(obj, null) (a trivial map), containsKey and clear. IdentityHashMap is used as a Set. Why don't you use HashSet instead? You can replace visited.put(obj, null) by visited.add(obj), contains instead of containsKey and clear remains the same.

The gain might not be as trivial as it first appeared to me: HashSet uses HashMap which has a maximum capacity of 2^30 (because of integers, but still twice as much as IdentityHashMap) so the problem will remain when having more than 1 billion objects.

I do not know Java well enough to say what is the correct data structure to use here. Most of the classes in the standard library seem to be constrained by the integer size. I cannot tell if the problem appears a lot in practice: one of my colleagues had this problem so I thought it would be nice to report.

fracpete commented 4 years ago

That the original author no longer has their website up and running is tough, but nothing I can do about.

Maybe linking to the IdentifyHashMap's Javadoc wasn't enough. Let me illustrate the uniqueness problem with an example (including your preference for sets or other maps):

import java.util.*;

public class Identity {

  public static class A {

    private String name;

    public A(String name) {
      this.name = name;
    }

    public String getName() {
      return name;
    }

    public boolean equals(Object other) {
      return (other instanceof A) && getName().equals(((A) other).getName());
    }

    public int hashCode() {
      return name.hashCode();
    }
  }

  public static void main(String[] args) throws Exception {
    A a1 = new A("a");
    A a2 = new A("a");
    A a3 = new A("b");

    IdentityHashMap im = new IdentityHashMap();
    im.put(a1, null);
    im.put(a2, null);
    im.put(a3, null);
    System.out.println("im: " + im.size());

    HashMap hm = new HashMap();
    hm.put(a1, null);
    hm.put(a2, null);
    hm.put(a3, null);
    System.out.println("hm: " + hm.size());

    HashSet hs = new HashSet();
    hs.add(a1);
    hs.add(a2);
    hs.add(a3);
    System.out.println("hs: " + hs.size());
  }
}

The main method creates three objects and places them in an IdentityHashMap, a HashMap and a HashSet. Here is the output:

im: 3
hm: 2
hs: 2

As you can see, only the IdentityHashMap outputs the correct count.

IdentityHashMap is used by SizeOfAg, as it was (and maybe still is?) the only data structure for keeping track of all objects that are used as map key. It is a sort-of Map implementation that violates the actual Map contract of (k1==null ? k2==null : k1.equals(k2)) by using (k1==k2) instead.

Aunsiels commented 4 years ago

Yes, still the problem remains: this method to compute the size does not scale to large number of objects. You are limited to 2^29 objects. As I said, I do not know the solution.