PILLUTLAAVINASH / google-enterprise-connector-manager

Automatically exported from code.google.com/p/google-enterprise-connector-manager
0 stars 0 forks source link

Add Ignored set to MimeTypeMap #143

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
TraversalContext MimeTypeMap currently has classes 'preferred',
'supported', 'unsupported', and 'unknown'.  Even with 'unsupported' and
'unknown' mime-types, the connectors are permitted (actually encouraged) to
submit meta-data about the document, but not submit the document content
itself.

Several customers have expressed a desire to skip certain classes of
document completely.  They do not wish these documents to be fed, even if
only meta-data would indexed.  

The primary reasons for this are:

1) Business model restricting searchable content to specific document types.

2) Documents fed with meta-data but no content still consume a document
license.

I suggest adding an additional class to the MimeTypeMap, perhaps
'ignoredMimeTypes'.  Any content types listed here should be ignored
(skipped) by the Connector during traversal.  These items should either not
be added to a DocumentList under construction, or not returned from
DocumentList.nextDocument().  The (TraversalContextAware) Connector may
choose which approach is suitable for the ECM.

Original issue reported on code.google.com by Brett.Mi...@gmail.com on 15 Apr 2009 at 10:31

GoogleCodeExporter commented 8 years ago

Original comment by mgron...@gmail.com on 6 May 2009 at 11:03

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 19 Sep 2009 at 3:57

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 25 Sep 2009 at 1:29

GoogleCodeExporter commented 8 years ago

Original comment by Brett.Mi...@gmail.com on 10 Oct 2009 at 12:33

GoogleCodeExporter commented 8 years ago
Fixed 20 October 2009 in Connector Manager revision r2281

Original comment by Brett.Mi...@gmail.com on 20 Oct 2009 at 10:36

GoogleCodeExporter commented 8 years ago

Original comment by jl1615@gmail.com on 27 Oct 2009 at 11:05

GoogleCodeExporter commented 8 years ago
Revision r2281 adds a new MimeTypeMap set for documents
that should be 'ignored' (skipped).  Basically the use is:

If TraversalContext.mimeTypeSupportLevel(String mimetype)
returns -1, the document should be skipped.  Do not index
metadata. Do not index content.

At this time, support for this feature still rests with the
connector implementation.  John and I were discussing the
best way to implement skipping documents that fall into the
'ignored' set.  The connector could skip them internally -
either silently or log a message.  This would be easy for
the Livelink connector, but difficult for the Documentum
connector.

Months ago, Jeff and Marty mentioned that customers wanted
to know which documents were skipped and for what reason.
From that perspective, it makes more sense for the Connector
to throw a RepositoryDocumentException, which will skip the
document and log the action.  From the implementation point
of view, it is already supported by the CM, and could be
easily implemented in the Livelink and Documentum Connectors
with 2-4 lines of code.

One downside is that RepositoryDocumentExceptions are
logged at WARNING level.  This change creates a new subclass
of RepositoryDocumentException, SkipDocumentException,
that gets caught separately and logged at FINER.  Connectors
could use it not just for 'ignored' mimetypes, but any other
benevolent reason that a document gets skipped (perhaps some
other advanced config option).

SkippedDocumentException was added Connector Manager revision r2325

Original comment by Brett.Mi...@gmail.com on 4 Nov 2009 at 6:56