NeowayLabs / neosearch

Full Text Search Library
30 stars 4 forks source link

Document format on storage [document.db] #1

Open i4ki opened 9 years ago

i4ki commented 9 years ago

The format of the original document when stored have a bigger performance issue. At the moment, we don't optimize anything, the document value is stored as a []byte JSON string. In that manner, we need do json.Unmarshal every time we need the document content, and this operation can be very expensive.

https://github.com/NeowayLabs/neosearch/blob/master/index/index.go#L168

Some ideas that we can try:

  1. Use the gob package to store the document as native golang binary. Much like write(myStruct, sizeof(myStruct)) in C.
  2. Store the fields separately.
    • key=1.id, value=1
    • key=1.name, value=Plan9 Operating System
    • key=1.authors.0, value=Ken Thompson
    • key=1.authors.1, value=Dennis Rithie
    • and so on for 2.id, 2.name, etc...
  3. Others?

I really like the second option because only in rare cases the user will ask for all of the document fields. If we add the requirement of user need ask only the fields he want in the API, then (maybe) we can benefit a lot in performance. If the document is bigger, one seek in the disk can be much slower than N seeks for specific fields. But for the inverse, for small documents, we can lost some performance too...