Closed GoogleCodeExporter closed 8 years ago
Both the name and class comments of MutableWordList suggest that it should
provide mutator methods like add/remove; indeed the concrete subclasses provide
such methods. The MutableWordList interface should explicitly declare these
operations.
Original comment by marvin.addison@gmail.com
on 9 Aug 2010 at 7:48
Original comment by marvin.addison@gmail.com
on 9 Aug 2010 at 7:48
I believe the WordList abstraction is clean enough to support its use directly
in the Dictionary implementations. I believe it's asking for trouble to allow
WordList instances to be mutable, which the current implementation requires.
Additionally, copying the contents into a List<String> for StringListDictionary
seems redundant; why not simply operate on the WordList directly? For example,
there could be a Dictionary implementation that takes simply a WordList, and
the WordList implementation determines the data storage and search semantics.
The TernaryTreeDictionary could then be free to use the former implementation.
I'm fairly certain that the current ternary tree implementation affords little
to no benefits of mix-and-match, since some combinations (e.g. FileWordList)
would be unnatural if not dangerous.
Original comment by marvin.addison@gmail.com
on 9 Aug 2010 at 8:11
> Both the name and class comments of MutableWordList suggest that it should
provide mutator methods like add/remove; indeed the concrete subclasses provide
such methods. The MutableWordList interface should explicitly declare these
operations.
Since the mutator methods are inherited from the Collection interface I'm not
sure adding them specifically to this interface will make it's use any more
clear.
Particularly since it says nothing about it's iterator, which can also choose
to modify the collection.
Original comment by dfis...@gmail.com
on 10 Aug 2010 at 3:28
> I believe it's asking for trouble to allow WordList instances to be mutable,
which the current implementation requires.
WordLists need to be mutable for two reasons. Sorting and memory management.
(I think the current impls need some work to take full advantage of the
latter...)
> Additionally, copying the contents into a List<String> for
StringListDictionary seems redundant; why not simply operate on the WordList
directly?
Because the WordList implementations will not be as performant as using an
ArrayList directly.
However, the WordListDictionary does provide the ability to operate directly on
a WordList.
> I'm fairly certain that the current ternary tree implementation affords
little to no benefits of mix-and-match, since some combinations (e.g.
FileWordList) would be unnatural if not dangerous.
Since FileWordList isn't mutable, that combo isn't allowed.
The main goal here is to allow the developer to select the combo that provides
the desired performance and memory characteristics.
I don't think there are currently any dangerous combinations, just potentially
slow combinations.
Original comment by dfis...@gmail.com
on 10 Aug 2010 at 4:11
In order to fully explore and validate some of my design suggestions, I have
implemented them in the following branch for review,
http://code.google.com/p/vt-middleware/source/browse/vt-dictionary/branches/issu
e-88. Some highlights:
- WordList interface now describes immutable random-access list of words.
- Case sensitivity and sorting semantics are now tightly coupled.
- StringListDictionary has been removed because its functionality and performance characteristics can be replaced with WordListDictionary+ArrayWordList.
- Significant reduction in source size.
The refactoring revealed some interesting obstacles with the
TernaryTreeDictionary. It was my intention to provide case-insensitive
searching without normalizing the case of inserted words. The new
implementation works well for exact searches, but for the near and partial
searches the unit tests fail in an interesting way. Since the new
implementation does not normalize the case of the inserted words for
case-insensitive behavior, search results contain the case of the character
that was first inserted into the dictionary. A concrete example helps
illustrate this behavior:
Word List: case-insensitive web2
Expected near search results: [Jicaque, Jicaquean, jocoque, macaque, Xicaque]
Actual near search results: [Jicaque, Jicaquean, Jocoque, Macaque, Xicaque]
Note the strange mixed case of Jocoquoe and Macaque which are not proper nouns.
Again, this results from the insertion order of capital "J" words preceding
lowercase "j" words while the two are treated equivalent for the sake of
comparison. In general, near and partial search results will contain the case
of a character that was inserted first in the tree.
I believe the best implementation is one that returns the word from the source
word list, so the search results for the above test case would ideally be the
following:
[Jicaque, Jicaquean, jocoque, macaque, Xicaque]
Initial study of an implementation that supports the above behavior indicates
that it is not straightforward.
Original comment by marvin.addison@gmail.com
on 25 Aug 2010 at 2:40
Committed fix for mixed case problem in TernaryTreeDictionary in r1519 of
issue-88 branch.
Original comment by marvin.addison@gmail.com
on 25 Aug 2010 at 5:38
Initial comments:
- I like the simplicity of WordList, but making them immutable means they cannot be sorted. That's a bummer, but perhaps it adds too much complexity.
- FilePointerWordList should be removed as it's main value was using and sorting multiple files
- WordListUtils should provide a method that accepts multiple file readers to replace the functionality lost in FilePointerWordList
- I would rename WordListUtils to WordLists to conform with java naming
- It's a shame to have to implement a binary search, but I don't see any other alternative
- The inner class iterators on TernaryTreeDictionary should be moved to AbstractWordList
Integrating the notion of case sensitivity from the WordList to the Dictionary
has always been ugly.
It may be cleaner to provide a custom implementation of Comparator
(WordListComparator) to make that data exchange easier.
I'm thinking it would provide comparator functionality for both Strings and
Characters and also provide an extension point for users.
This would allow TernaryTreeDictionary to use the Comparator from the WordList
and drop the internal boolean setting, plus the code that is used to derive it.
Original comment by dfis...@gmail.com
on 27 Aug 2010 at 6:47
Testing with yourkit demonstrates that the new TernaryTree impl uses ~62MB
while the old impl uses ~36MB for the websters dictionary.
Original comment by dfis...@gmail.com
on 1 Sep 2010 at 2:23
I think it's worth testing whether storing a char[] instead of String in
TernaryNode makes any difference. I'm fairly certain object overhead is
non-trivial in this case. I think you'd still want to hide the internal
storage details, and keep get/setWord such that they deal with Strings.
Original comment by marvin.addison@gmail.com
on 1 Sep 2010 at 2:59
Changes merged into trunk.
Notable changes include:
- WordList impls throw IllegalArgumentException if the underlying words are not sorted according to their defined comparator
- Simplified binary search algorithm
- TernaryTree uses Comparator<Character> rather than a private method
- TernaryTree throws UnsupportedOperatonException for both partial and near searches when caseSensitive is false
- Added WordList#iterator() and WordList#medianIterator()
Original comment by dfis...@gmail.com
on 2 Sep 2010 at 3:23
Merge looks nice. I reviewed all the notable changes you mentioned, and I
would agree they're all improvements. The only thing I noted was that you
didn't merge my unit tests. That's understandable since they didn't conform,
but I believe the test strategy is an improvement we might consider for the
future.
Original comment by marvin.addison@gmail.com
on 3 Sep 2010 at 6:33
Original comment by dfis...@gmail.com
on 17 Sep 2010 at 7:03
Original issue reported on code.google.com by
marvin.addison@gmail.com
on 9 Aug 2010 at 7:44