Stefan4472 / simple-search-engine

GNU General Public License v3.0
0 stars 0 forks source link

Investigate duplicate slug handling #15

Closed Stefan4472 closed 2 years ago

Stefan4472 commented 2 years ago

Something I noticed when doing migrations on Stefans-Blog: when I re-index a file under the same slug, it appears that I get duplicated search results. Adding the same slug a second time should overwrite the existing document.

Stefan4472 commented 2 years ago

Ideal behavior: upon index_file() or index_string(), user provides an overwrite flag, which overwrites the existing document if it is already in the database. This, however, requires a mechanism by which we can remove or modify the Inverted Indexes. That's an issue for a later day.

For now the workaround will probably be to simply not re-index the file if it is already in the index.

Stefan4472 commented 2 years ago

*For now the workaround will probably be for the caller to simply not re-index the file if it is already in the index.

Also: see Python's bisect module, which we can use to better implement searching for doc_id in an InvertedList: https://www.tutorialspoint.com/python-inserting-item-in-sorted-list-maintaining-order

The probable next step will be to support removing a document from the index. On indexing a duplicate, if override=True, the SearchEngine can delete the file from the index, then re-add it.

Stefan4472 commented 2 years ago

Opened #21 to address this

Stefan4472 commented 2 years ago

And created #22 with the bisect idea