Open Zhiyuan-Amos opened 6 years ago
So you are saying that we constructor a TreeMap
during run time and add the categories like what we are doing now, and Author
, Venue
and Title
are comparable so that when we insert into the TreeMap
they will automatically collide and merge? What is the difference between this and using a HashMap
, but with the appropriate comparable Author
, Venue
and Title
?
If we implement compareTo() and equals(), the HashMap
will also detect collisions for us right?
Wow, it actually sounds like it might work. I am shocked that I never considered this. haha
Actually maybe you are right, since HashMap uses hash code lol. My mistake..
Yup, HashMap requires us to override both equals(Object)
and hashCode()
. The contract between equals(Object)
and hashCode()
is such that if both objects are equal, they must generate the same hash code. But the problem is this:
public boolean equals(Object other) {
return StringUtil.compare(this, other); // a simplified version.
}
public int hashCode() {
return this.venue.hashCode();
}
There's no way to ensure that ICSE
and ICSE@ICSE
returns the same hash code. We can use the StringUtil
method to verify equality, but this violates the equals(Object)
and hashCode()
contract. That's why we cannot use HashMap.
Whereas for TreeMap, we just have to implement Comparable
(compareTo(T)
), don't need hashCode()
, so the above problem won't be relevant here.
We can actually create Comparator objects instead of encapsulating the Strings in objects. The pro is that we can easily change the method of comparison during run time (If needed) The con is that more classes..
This may solve the speed issue, but will it slow down other areas of the code? From O(1) -> O(lgn)
Also, this does not solve the problem of false positives. For example, I recently found out (while testing the application) that neuro
is a venue, and so is neuro report
. But they are likely different venues.
We can actually create Comparator objects instead of encapsulating the Strings in objects. The pro is that we can easily change the method of comparison during run time (If needed) The con is that more classes.
Yup Comparators will do as well. If you are doing both case-sensitive and case-insensitive, using Comparators will do the job cos you can only have 1 compareTo(T)
, but you can have multiple Comparators. We can define Comparators as constants public static final
in a constants class, so you don't need more classes for that. Yup, probably this way you don't need Author
, Venue
and Title
as well.
This may solve the speed issue, but will it slow down other areas of the code? From O(1) -> O(lgn)
Yup, it slows down at insertion from O(n) to O(nlgn) (about 17x slower where n = 200,000), but I think it doesn't affect other operations, because we usually iterate through the entire Map anyway (i.e. we iterate through all the values). E.g. in TrendCommand
, other operations such as filtering and remapping will iterate through entire Map, so there's no difference since we don't call HashMap#get(Object)
. Also, if we reduce O(n^2) to O(nlgn) at the expense of increasing insertion from O(n) to O(nlgn), it seems like a good trade-off.
Also, this does not solve the problem of false positives. For example, I recently found out (while testing the application) that neuro is a venue, and so is neuro report.
This one bobian haha.
This one bobian haha.
Ya lo, in addition, we can get very strange behavior with the algo I thought of previously (Split and compare) because stuff like neurology lovers
and neurology haters
will merge. Or even stuff like The Foo
and bar the bar
will merge also..
So actually, not just speed, I think the algo I came up with is wrong. I was thinking if we could use fuzzy matching or even suffix trees
@IanTeo, just fyi
Just an additional thought on whether we should have
Venue
,Author
andTitle
classes:The reason why we need to merge the keys is that we are mapping them as
String
s i.e.Venue
ICSE != ICSE@ICSE. However, if we map them asTreeMap<Venue, Collection<Paper>>
instead, wherebyVenue
,Author
&Title
implementsComparable
accordingly, then there's no need to merge them already, since when we can use theCollectors#toMap()
to resolve the collisions by joining theCollection<Paper>
. This shifts a large chunk of the logic from theCommand
s to theModel
, whereby theModel
simply needs to implementComparable
. That way, we won't face the issue of having to verify whether there are filters before deciding to merge because it will automatically be merged. The runtime will be O(nlgn) since insertion in a TreeMap is O(lgn), and we are inserting n elements. So, we can merge it (it's 17x slower, taking n = 200,000. Way faster than O(n^2) though) regardless of whether there are filters.Currently we have this code in
TopCommand
:This couples the
Command
with theModel
; theCommand
shouldn't know that if aPaper
doesn't have a venue, the default value is<venue unspecified>
. As such, if we have aVenue
class, we can call something likePaper#hasVenue()
, which then callsVenue#isExists()
. So the resulting code becomes something like:This way, you have a higher level of abstraction. For this, you can still do it without a
Venue
class sincehasVenue()
can callreturn venue.equals("")
as well, but that's cos in this case only the venue needs to be verified (i.e. we don't havehasTitle()
etc) so it doesn't bloat up thePaper
class so much.