vt-dictionary Code Review Request

GoogleCodeExporter commented 8 years ago

Review changes to support alternative dictionary implementations and provide 
feedback.

Original issue reported on code.google.com by marvin.addison@gmail.com on 9 Aug 2010 at 7:44

GoogleCodeExporter commented 8 years ago

Both the name and class comments of MutableWordList suggest that it should 
provide mutator methods like add/remove; indeed the concrete subclasses provide 
such methods.  The MutableWordList interface should explicitly declare these 
operations.

Original comment by marvin.addison@gmail.com on 9 Aug 2010 at 7:48

GoogleCodeExporter commented 8 years ago

Original comment by marvin.addison@gmail.com on 9 Aug 2010 at 7:48

Changed state: Started

GoogleCodeExporter commented 8 years ago

I believe the WordList abstraction is clean enough to support its use directly 
in the Dictionary implementations.  I believe it's asking for trouble to allow 
WordList instances to be mutable, which the current implementation requires.  
Additionally, copying the contents into a List<String> for StringListDictionary 
seems redundant; why not simply operate on the WordList directly?  For example, 
there could be a Dictionary implementation that takes simply a WordList, and 
the WordList implementation determines the data storage and search semantics.  
The TernaryTreeDictionary could then be free to use the former implementation.  
I'm fairly certain that the current ternary tree implementation affords little 
to no benefits of mix-and-match, since some combinations (e.g. FileWordList) 
would be unnatural if not dangerous.

Original comment by marvin.addison@gmail.com on 9 Aug 2010 at 8:11

GoogleCodeExporter commented 8 years ago

> Both the name and class comments of MutableWordList suggest that it should 
provide mutator methods like add/remove; indeed the concrete subclasses provide 
such methods.  The MutableWordList interface should explicitly declare these 
operations.

Since the mutator methods are inherited from the Collection interface I'm not 
sure adding them specifically to this interface will make it's use any more 
clear.
Particularly since it says nothing about it's iterator, which can also choose 
to modify the collection.

Original comment by dfis...@gmail.com on 10 Aug 2010 at 3:28

GoogleCodeExporter commented 8 years ago

> I believe it's asking for trouble to allow WordList instances to be mutable, 
which the current implementation requires.  

WordLists need to be mutable for two reasons. Sorting and memory management.
(I think the current impls need some work to take full advantage of the 
latter...)

> Additionally, copying the contents into a List<String> for 
StringListDictionary seems redundant; why not simply operate on the WordList 
directly?

Because the WordList implementations will not be as performant as using an 
ArrayList directly.
However, the WordListDictionary does provide the ability to operate directly on 
a WordList.

> I'm fairly certain that the current ternary tree implementation affords 
little to no benefits of mix-and-match, since some combinations (e.g. 
FileWordList) would be unnatural if not dangerous.

Since FileWordList isn't mutable, that combo isn't allowed.
The main goal here is to allow the developer to select the combo that provides 
the desired performance and memory characteristics.
I don't think there are currently any dangerous combinations, just potentially 
slow combinations.

Original comment by dfis...@gmail.com on 10 Aug 2010 at 4:11

GoogleCodeExporter commented 8 years ago

In order to fully explore and validate some of my design suggestions, I have 
implemented them in the following branch for review, 
http://code.google.com/p/vt-middleware/source/browse/vt-dictionary/branches/issu
e-88.  Some highlights:

 - WordList interface now describes immutable random-access list of words.
 - Case sensitivity and sorting semantics are now tightly coupled.
 - StringListDictionary has been removed because its functionality and performance characteristics can be replaced with WordListDictionary+ArrayWordList.
 - Significant reduction in source size.

The refactoring revealed some interesting obstacles with the 
TernaryTreeDictionary.  It was my intention to provide case-insensitive 
searching without normalizing the case of inserted words.  The new 
implementation works well for exact searches, but for the near and partial 
searches the unit tests fail in an interesting way.  Since the new 
implementation does not normalize the case of the inserted words for 
case-insensitive behavior, search results contain the case of the character 
that was first inserted into the dictionary.  A concrete example helps 
illustrate this behavior:

Word List: case-insensitive web2
Expected near search results: [Jicaque, Jicaquean, jocoque, macaque, Xicaque]
Actual near search results: [Jicaque, Jicaquean, Jocoque, Macaque, Xicaque]

Note the strange mixed case of Jocoquoe and Macaque which are not proper nouns. 
 Again, this results from the insertion order of capital "J" words preceding 
lowercase "j" words while the two are treated equivalent for the sake of 
comparison.  In general, near and partial search results will contain the case 
of a character that was inserted first in the tree.

I believe the best implementation is one that returns the word from the source 
word list, so the search results for the above test case would ideally be the 
following:

[Jicaque, Jicaquean, jocoque, macaque, Xicaque]

Initial study of an implementation that supports the above behavior indicates 
that it is not straightforward.

Original comment by marvin.addison@gmail.com on 25 Aug 2010 at 2:40

GoogleCodeExporter commented 8 years ago

Committed fix for mixed case problem in TernaryTreeDictionary in r1519 of 
issue-88 branch.

Original comment by marvin.addison@gmail.com on 25 Aug 2010 at 5:38

GoogleCodeExporter commented 8 years ago

Initial comments:
  - I like the simplicity of WordList, but making them immutable means they cannot be sorted. That's a bummer, but perhaps it adds too much complexity.
  - FilePointerWordList should be removed as it's main value was using and sorting multiple files
  - WordListUtils should provide a method that accepts multiple file readers to replace the functionality lost in FilePointerWordList
  - I would rename WordListUtils to WordLists to conform with java naming
  - It's a shame to have to implement a binary search, but I don't see any other alternative
  - The inner class iterators on TernaryTreeDictionary should be moved to AbstractWordList

Integrating the notion of case sensitivity from the WordList to the Dictionary 
has always been ugly.
It may be cleaner to provide a custom implementation of Comparator 
(WordListComparator) to make that data exchange easier.
I'm thinking it would provide comparator functionality for both Strings and 
Characters and also provide an extension point for users.
This would allow TernaryTreeDictionary to use the Comparator from the WordList 
and drop the internal boolean setting, plus the code that is used to derive it.

Original comment by dfis...@gmail.com on 27 Aug 2010 at 6:47

GoogleCodeExporter commented 8 years ago

Testing with yourkit demonstrates that the new TernaryTree impl uses ~62MB 
while the old impl uses ~36MB for the websters dictionary.

Original comment by dfis...@gmail.com on 1 Sep 2010 at 2:23

GoogleCodeExporter commented 8 years ago

I think it's worth testing whether storing a char[] instead of String in 
TernaryNode makes any difference.  I'm fairly certain object overhead is 
non-trivial in this case.  I think you'd still want to hide the internal 
storage details, and keep get/setWord such that they deal with Strings.

Original comment by marvin.addison@gmail.com on 1 Sep 2010 at 2:59

GoogleCodeExporter commented 8 years ago

Changes merged into trunk.
Notable changes include:
  - WordList impls throw IllegalArgumentException if the underlying words are not sorted according to their defined comparator
  - Simplified binary search algorithm
  - TernaryTree uses Comparator<Character> rather than a private method
  - TernaryTree throws UnsupportedOperatonException for both partial and near searches when caseSensitive is false
  - Added WordList#iterator() and WordList#medianIterator()

Original comment by dfis...@gmail.com on 2 Sep 2010 at 3:23

GoogleCodeExporter commented 8 years ago

Merge looks nice.  I reviewed all the notable changes you mentioned, and I 
would agree they're all improvements.  The only thing I noted was that you 
didn't merge my unit tests.  That's understandable since they didn't conform, 
but I believe the test strategy is an improvement we might consider for the 
future.

Original comment by marvin.addison@gmail.com on 3 Sep 2010 at 6:33

GoogleCodeExporter commented 8 years ago

Original comment by dfis...@gmail.com on 17 Sep 2010 at 7:03

Changed state: Done

DanElbert / vt-middleware

vt-dictionary Code Review Request #88