fire-eggs / YAGP

"Yet Another GEDCOM Parser" - newer/faster/complete GEDCOM parser in C#
Apache License 2.0
9 stars 3 forks source link

Memory pig: repeated tag strings #38

Closed fire-eggs closed 6 years ago

fire-eggs commented 6 years ago

dotMemory profiling found repeated tag strings consuming memory.

For one file (TomsTreePriv.ged?): 15.15M wasted - 683,131 objects SURN, _UID, GIVN, FAMS, FAMC, BIRT, DEAT

Investigate replacing all tags with a single array of strings and having each class containing a tag to store only an index into that array. Will an int (word? byte?) take less memory in a class instance?

NOTE: GEDSplitter.Tag should be reworked to return a TagIndex from a char[]. Need to avoid converting the existing char[] to a string then doing a dictionary lookup or something: this is a candidate for a Trie?

fire-eggs commented 6 years ago

GEDCOM.Net has a good concept. Tags, Names, Surnames, PlaceNames and idents are stored in a local utility class called IndexedKeyCollection. I.e. when a tag is encountered, the Tag collection is searched; if found, a reference to the string in the collection is returned. If not found, the 'new' tag is added to the collection.

This (should?) mean all usages of a given tag are a reference to the single string instance in the collection, instead of separate copies. I need to verify this.

For YAGP, I think a Dictionary<char[], string> would work. At the GEDSplitter level, the components of the input line (tag and ident) are char[]. It would also be possible to initialize the tag collection with the standard/most-common tags.

One opportunity which Gedcom.Net appears to miss is tracking duplicate Soundex strings. The same name/surname/placename would have the same Soundex. As YAGP doesn't provide Soundex (yet), this is not an immediate item.

fire-eggs commented 6 years ago

Concern: how thread-safe is the standard Dictionary?

fire-eggs commented 6 years ago

Addressed via commit 9f3236d28227b2c9f79e825163f0dfa8ab1cdccf