Closed fire-eggs closed 6 years ago
GEDCOM.Net has a good concept. Tags, Names, Surnames, PlaceNames and idents are stored in a local utility class called IndexedKeyCollection. I.e. when a tag is encountered, the Tag collection is searched; if found, a reference to the string in the collection is returned. If not found, the 'new' tag is added to the collection.
This (should?) mean all usages of a given tag are a reference to the single string instance in the collection, instead of separate copies. I need to verify this.
For YAGP, I think a Dictionary<char[], string> would work. At the GEDSplitter level, the components of the input line (tag and ident) are char[]. It would also be possible to initialize the tag collection with the standard/most-common tags.
One opportunity which Gedcom.Net appears to miss is tracking duplicate Soundex strings. The same name/surname/placename would have the same Soundex. As YAGP doesn't provide Soundex (yet), this is not an immediate item.
Concern: how thread-safe is the standard Dictionary?
Addressed via commit 9f3236d28227b2c9f79e825163f0dfa8ab1cdccf
dotMemory profiling found repeated tag strings consuming memory.
For one file (TomsTreePriv.ged?): 15.15M wasted - 683,131 objects SURN, _UID, GIVN, FAMS, FAMC, BIRT, DEAT
Investigate replacing all tags with a single array of strings and having each class containing a tag to store only an index into that array. Will an int (word? byte?) take less memory in a class instance?
NOTE: GEDSplitter.Tag should be reworked to return a TagIndex from a char[]. Need to avoid converting the existing char[] to a string then doing a dictionary lookup or something: this is a candidate for a Trie?