collective / collective.nitf

A Dexterity-based content type inspired on the News Industry Text Format specification
8 stars 3 forks source link

Unnecessary SearchableText metadata #232

Closed idgserpro closed 3 years ago

idgserpro commented 4 years ago

There is an unnecessary SearchableText metadata:

https://github.com/collective/collective.nitf/blob/787d02bc15d3fc42105b43e488314ac94bc839af/src/collective/nitf/profiles/default/catalog.xml#L10

This probably makes the catalog grow unnecessarily.

idgserpro commented 4 years ago

@mauritsvanrees if an upgrade step is make to remove this metadata, is it necessary to do a clear and rebuild in the catalog?

mauritsvanrees commented 4 years ago

Yes.

Or you can do something like this (not tried):

for brain in catalog.unrestrictedSearchResults():
    obj = brain.getObject()
    catalog.catalog_object(obj, idxs=["id"]

This recatalogs the item, but only reindexes a single, cheap index: the id index (assuming this exists).

Depending on which Products.ZCatalog version this is, you may be able to use catalog.getAllBrains(), which would be better. See https://github.com/collective/collective.catalogcleanup/pull/24 for some code.

idgserpro commented 4 years ago

Thanks!

hvelarde commented 4 years ago

I don't think that's unnecessary.

idgserpro commented 4 years ago

@hvelarde for body text:

<p>Content of nitf</p>
<p>test content</p>
<p><strong>text strong</strong></p>

we have the SearchableText metadata:

title-of-nitf Title of nitf Description of nitf Content of nitf test content text strong

In what situation can we use this metadata? Doesn't collective.nitf use it anywhere. And for files, it can be really big, making the catalog big, and making indexing take longer.

idgserpro commented 4 years ago

@hvelarde sorry. In comment above I had put the SearchableText index. Now I fix it for metadata. Anyway, still find it unnecessary.

hvelarde commented 4 years ago

IIRC, this is necessary to include all metadata of an instance of this content type in the index. check the code and you'll find the subtitle, the byline, the location and many other attributes.

idgserpro commented 4 years ago

@hvelarde I think you're mixed. I'm talking about the SearchableText metadata (column) and not of SearchableText index. The SearchableText index:

https://github.com/collective/collective.nitf/blob/91095c633b7c3228654051fc95231e022e654136/src/collective/nitf/setuphandlers.py#L61

Is important and must be maintained. It really serves to group several fields in a single index.

But the SearchableText metadata (column):

https://github.com/collective/collective.nitf/blob/787d02bc15d3fc42105b43e488314ac94bc839af/src/collective/nitf/profiles/default/catalog.xml#L10

Is not necessary.

For exemple:

>>> from plone import api
>>> results = api.content.find(SearchableText='Text') # This uses the SearchableText index and is important
>>> brain = results[0]
>>> brain.SearchableText # This is the SearchableText metadata and is unnecessary. Where will we use the string below?
u'title-of-nitf Title of nitf  Description of nitf   Content of nitf \r\n test content \r\n text strong   

It is possible to have the SearchableText index without having the SearchableText metadata. I removed the SearchableText metadata and run the tests. All tests passed.