Exploring the possibility of Venue, Author and Title classes

Zhiyuan-Amos commented 6 years ago

@IanTeo, just fyi

Just an additional thought on whether we should have Venue, Author and Title classes:

The reason why we need to merge the keys is that we are mapping them as Strings i.e. Venue ICSE != ICSE@ICSE. However, if we map them as TreeMap<Venue, Collection<Paper>> instead, whereby Venue, Author & Title implements Comparable accordingly, then there's no need to merge them already, since when we can use the Collectors#toMap() to resolve the collisions by joining the Collection<Paper>. This shifts a large chunk of the logic from the Commands to the Model, whereby the Model simply needs to implement Comparable. That way, we won't face the issue of having to verify whether there are filters before deciding to merge because it will automatically be merged. The runtime will be O(nlgn) since insertion in a TreeMap is O(lgn), and we are inserting n elements. So, we can merge it (it's 17x slower, taking n = 200,000. Way faster than O(n^2) though) regardless of whether there are filters.

Currently we have this code in TopCommand:

if (category == Category.VENUE) {
papers = CollectionUtility.removeFromCollection(papers,
        paper -> !paper.getVenue().equals("<venue unspecified>"));
}

This couples the Command with the Model; the Command shouldn't know that if a Paper doesn't have a venue, the default value is <venue unspecified>. As such, if we have a Venue class, we can call something like Paper#hasVenue(), which then calls Venue#isExists(). So the resulting code becomes something like:

// In TopCommand
if (category == Category.VENUE) {
    papers = CollectionUtility.removeFromCollection(papers, paper -> !paper.hasVenue());
}

// In Paper class
public boolean hasVenue() {
    return venue.isExists();
}

// In Venue
public boolean isExists() {
    return venue.equals(""); // the internal representation can be either "" or "<venue unspecified>", but I chose "" here
}

This way, you have a higher level of abstraction. For this, you can still do it without a Venue class since hasVenue() can call return venue.equals("") as well, but that's cos in this case only the venue needs to be verified (i.e. we don't have hasTitle() etc) so it doesn't bloat up the Paper class so much.

IanTeo commented 6 years ago

So you are saying that we constructor a TreeMap during run time and add the categories like what we are doing now, and Author, Venue and Title are comparable so that when we insert into the TreeMap they will automatically collide and merge? What is the difference between this and using a HashMap, but with the appropriate comparable Author, Venue and Title?

If we implement compareTo() and equals(), the HashMap will also detect collisions for us right?

Wow, it actually sounds like it might work. I am shocked that I never considered this. haha

IanTeo commented 6 years ago

Actually maybe you are right, since HashMap uses hash code lol. My mistake..

Zhiyuan-Amos commented 6 years ago

Yup, HashMap requires us to override both equals(Object) and hashCode(). The contract between equals(Object) and hashCode() is such that if both objects are equal, they must generate the same hash code. But the problem is this:

public boolean equals(Object other) {
    return StringUtil.compare(this, other); // a simplified version.
}

public int hashCode() {
    return this.venue.hashCode();
}

There's no way to ensure that ICSE and ICSE@ICSE returns the same hash code. We can use the StringUtil method to verify equality, but this violates the equals(Object) and hashCode() contract. That's why we cannot use HashMap.

Whereas for TreeMap, we just have to implement Comparable (compareTo(T)), don't need hashCode(), so the above problem won't be relevant here.

IanTeo commented 6 years ago

We can actually create Comparator objects instead of encapsulating the Strings in objects. The pro is that we can easily change the method of comparison during run time (If needed) The con is that more classes..

This may solve the speed issue, but will it slow down other areas of the code? From O(1) -> O(lgn) Also, this does not solve the problem of false positives. For example, I recently found out (while testing the application) that neuro is a venue, and so is neuro report. But they are likely different venues.

Zhiyuan-Amos commented 6 years ago

We can actually create Comparator objects instead of encapsulating the Strings in objects. The pro is that we can easily change the method of comparison during run time (If needed) The con is that more classes.

Yup Comparators will do as well. If you are doing both case-sensitive and case-insensitive, using Comparators will do the job cos you can only have 1 compareTo(T), but you can have multiple Comparators. We can define Comparators as constants public static final in a constants class, so you don't need more classes for that. Yup, probably this way you don't need Author, Venue and Title as well.

This may solve the speed issue, but will it slow down other areas of the code? From O(1) -> O(lgn)

Yup, it slows down at insertion from O(n) to O(nlgn) (about 17x slower where n = 200,000), but I think it doesn't affect other operations, because we usually iterate through the entire Map anyway (i.e. we iterate through all the values). E.g. in TrendCommand, other operations such as filtering and remapping will iterate through entire Map, so there's no difference since we don't call HashMap#get(Object). Also, if we reduce O(n^2) to O(nlgn) at the expense of increasing insertion from O(n) to O(nlgn), it seems like a good trade-off.

Also, this does not solve the problem of false positives. For example, I recently found out (while testing the application) that neuro is a venue, and so is neuro report.

This one bobian haha.

IanTeo commented 6 years ago

This one bobian haha.

Ya lo, in addition, we can get very strange behavior with the algo I thought of previously (Split and compare) because stuff like neurology lovers and neurology haters will merge. Or even stuff like The Foo and bar the bar will merge also..

So actually, not just speed, I think the algo I came up with is wrong. I was thinking if we could use fuzzy matching or even suffix trees

IanTeo / cs3219-Project

Exploring the possibility of Venue, Author and Title classes #48