Closed GoogleCodeExporter closed 9 years ago
Needs research on how the GSA-side interactions should work.
Original comment by donald.z...@gmail.com
on 18 Apr 2008 at 10:31
The connector manager should expose the restriction on :
doc size;
extension exclusion;
URL patterns;
The connectors should be restricted by these.
Original comment by jeffreyl...@gmail.com
on 24 May 2008 at 1:15
Suggested solution: Add configuration information to the
applicationContext.properties for: file size, list of Mime types.
Original comment by jeffreyl...@gmail.com
on 18 Jul 2008 at 8:35
Original comment by Brett.Mi...@gmail.com
on 25 Aug 2008 at 4:18
Description:
This set of changes addresses Connector Manager Issue 62 - Implement
TraversalContext
The constraints outlined in the Traversal Context do not match up well
to those configured by the GSA. For instance, the TraversalContext
specifies a max document size, whereas the GSA's maximum document size
depends upon the document type. The TraversalContext allows weighted
map of mime types to support (or not support), whereas the GSA uses a
list of filename patterns (mostly just extensions). This usually is
not too much of a problem, but there are a couple of annoying gotchas:
i.e. the GSA does not support any compressed file types - except
compressed Postcript files (.ps.Z or .ps.gz).
I enhanced the TraversalContext SPI slightly, offering a couple of
convenience methods that allow the connector to:
- determine the most preferred mime type from a supplied set.
(The Livelink connector could use this to select from various
Renditions.)
- determine the most preferred mime type to supply, given
a filename extension. The filename extension can be simple,
like .doc, or compound, like .ps.Z (see annoying gotcha above).
Both of these return null, if the mimetypes all have non-positive
support levels (they are unsupported).
I added a table of filename extensions to commonly associated
mime types. This table is a superset of information I gathered
from various sources and is provided as a plain text Resource
that could be edited by the customer, if needed.
License Note: Two of the sources I used were explicitly Apache 2.0
license projects (Apache2, and MimeUtils). The other two were were
web sites (one at Duke University). I sent email asking if it was
OK to use information taken off the sites, but have not heard back.
UPDATE: I heard back from Duke University and got the OK. I was
unable to contact the other web site, so I removed its set of mappings
and replaced it with those from the Apache Tomcat web.xml file.
The TraversalContext is constructed via Spring using a bean definition
in applicationContext.xml. The bean definition also allows the user
to modify the TraversalContext constraints to suit their needs. The
maximum document size, unknown mimetype support level, and the sets
of mimetypes to prefer, support, or not support can all be configured
here.
The mimetype support level code understands the concept of content
type classes - for instances, all audio, image, and video types
can be easily excluded. This is done by specifying a content type,
sans subtype, in one of the list. These content type classes are not
exclusive - if a contenttype/subtype entry appears elsewhere, that
more explicit entry takes precedence. For instance, you could specify:
... supported mime types ...
image/x-asciiart
...
... unsupported mime types ...
image
which means that ascii art images are supported, while all other
image formats are not supported.
TODO: The interaction between these content type classes and the
unknownMimeTypeSupport level need to closely examined.
TODO: The location of the content type classes (especially 'application')
in the supported vs unsupported sets needs to be examined.
Change Log:
----------
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/ProductionTraversalContext.java
- Implement preferredMimeType(Set mimeTypes) as a call to
MimeTypeMap.preferredMimeType(Set mimeTypes).
- Implement preferredMimeTypeForExtension(String extension) as a call to
MimeTypeMap.preferredMimeTypeForExtension(String extension).
- Initialize FileSizeLimitInfo and MimeTypeMap to defaults to
avoid null-pointer exception if they are not initialized by
applicationContext.xml.
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/QueryTraverser.java
- Set the TraversalContext on the TraversalManager (if it is
TraversalContextAware).
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/FileSizeLimitInfo.java
- Initialize the default maxDocumentSize to 30MB.
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
- Added setPreferredMimeTypes(Set mimeTypes). This compliments
setSupportedMimeTypes() and setUnsupportedMimeTypes(). Preferred
mimetypes are typically plain text or html that require little or
no format conversion when indexing, whereas supported mimetypes may
require significant format conversion and text extraction when indexing.
- Added preferredMimeType(Set mimeTypes) to the interface. This returns
the most preferred mimetype from the supplied set.
- Added preferredMimeTypeForExtension(String extension) to the interface.
This returns the most preferred mimetype from the set of mime types
associated with the supplied filename extension.
- Added mime type preferences even within the same gross support level.
For instance, IANA-registered '/vnd.*' subtypes are preferred
over experimental '/x-*' subtypes.
M
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalContext.java
- Added preferredMimeType(Set mimeTypes) to the interface.
- Added preferredMimeTypeForExtension(String extension) to the interface.
M
projects/connector-manager/source/javatests/com/google/enterprise/connector/trav
ersal/MimeTypeMapTest.java
- Fix tests.
M
projects/connector-manager/source/javatests/com/google/enterprise/connector/trav
ersal/SpringBasedProductionTraversalContextTest.java
- Fix tests.
M projects/connector-manager/etc/applicationContext.xml
- Added bean instantiation for the ProductionTraversalContext, including
child beans for MimeTypeMap, and FileSizeLimitInfo.
A projects/connector-manager/etc/ext2mimetype.txt
- A table of filename extensions to mime types. It is not a unique map -
an extension may have several mime types associated with it.
Original comment by Brett.Mi...@gmail.com
on 12 Sep 2008 at 9:15
Perhaps I need to clarify my original concerns.
One of my goals in this task was to map the
default supported and unsupported file types
from the GSA Crawl and Index page (specified
as a list of [commented-out or not] filename
patterns/extensions) to a list of supported and
unsupported mimetypes in the connector manager.
For instance, the GSA Crawl and Index page says:
# The following are popular filetype extensions -
# uncomment the lines to disable crawling them
# Microsoft Word
#.doc$
# Microsoft Excel
#.xls$
#.xlw$
# Microsoft Powerpoint
#.ppt$
# Microsoft Access
.mdb$
...
Note that Word, Excel, and Powerpoint filename
extensions are commented out, meaning they will
be crawled; but Access filename extension is not
commented out, meaning that Access databases will
not be crawled. This corresponds to the following
entries in the Connector Manager's MimeTypesMap:
<property name="supportedMimeTypes">
<set>
<value>application/msword</value>
<value>application/excel</value>
<value>application/powerpoint</value>
...
</set>
</property>
...
<property name="unSupportedMimeTypes">
<set>
<value>application/x-msaccess</value>
...
</set>
</property>
I ask you to look at the proposed full MimeTypeMap
near the end of the CM applicationContext.xml here:
http://tinyurl.com/5jfgcu
Mohit is correct. An entry of "application" is a
catch-all for "application/*" content subtypes not
explicitly specified elsewhere. For instance,
"application/pdf" is explicitly listed in the
supportedMimeTypes, so it is supported. But
"application/x-foobar" is not explicitly mentioned
in either the supported or unsupported lists, so
it would be matched by the "application" content
type class catch-all.
The use of content type catch-alls for "image",
"audio", "video", etc is an obvious advantage.
Since we don't support these media types, we
don't need to detail every single existing
(and future) "image/subtype" content type in
the unsupported list.
My original questions relate to how these catch-all
content types relate to the existing "unknown
mimetype support level" mechanism. The MimeTypeMap
in the above link contains catch-all content entries
for all of the standard content type classes, so it is
very unlikely that anything would end up as "unknown".
Additionally, should the "application" catch-all be
in the unsupported list, or the supported list,
or not in any list? Leaving off an "application"
catch-all would cause application subtypes not
explicitly mentioned to fall through to the "unknown
mimetype" support level. I am starting to lean toward
that configuration. [The same question could be made
of the "text" content type class catch-all.]
The Connector Manager MimeTypeMap does not reflect
end-user edits to the GSA Crawl and Index page,
so if the user wishes to explicitly include or
exclude certain file types, edits will have to
be made in both places. The ext2mimetype.txt
table can help them map filename extensions used
in the GSA Crawl and Index page to the mime types
specified in applicationContext.xml.
Original comment by Brett.Mi...@gmail.com
on 12 Sep 2008 at 9:18
Let me summarize to make sure we're all clear because Brett's recent comment
about:
Mohit is correct. An entry of "application" is a
catch-all for "application/*" content subtypes not
explicitly specified elsewhere.
was not entirely true in the code I originally reviewed (r937). In the recent
snapshot (r941) the MimeTypeMap.mimeTypeSupportLevel() method as been altered
to make
it entirely the case that this behaves as a catch-all as described above.
This summary is based on r941 of the change branch which is the most recent.
1. There are 3 sections of 'declared' mime types (read in from the
applicationContext.xml file) in the MimeTypeMap: 1) Preferred Mime Types, 2)
Supported Mime Types, and 3) Unsupported Mime Types.
2. In addition, if a given mime type is not found among the 'declared' types,
the
MimeTypeMap also supports a 4th level which is Unknown Mime Type Support Level.
The
actual value for this level can be set.
3. When searching for support level: preferred types > supported types >
unknown
types > unsupported types (which are less than 0).
4. When searching for support level given a mime type, there is no precedence
involved in checking the sets of 'declared' types - they are all contained in
one big
hash. There is, however, a precedence involved in checking for precision -
exact
matches first and then, if there is no match and a subtype was given, trying to
match
the root media-type without the given subtype. [Again for clarity a Mime Type
declaration as I have defined it is <media-type>/<subtype> so in
application/msword,
'application' is the media-type and 'msword' is the subtype.]
5. The appearance of just the media-type without a subtype in the set of
'declared'
mime types will act as a catch-all for all subtypes of that media-type that do
not
appear in any of the other declarations. So 'application' as a declared mime
type
will match 'application/*' iff there wasn't an exact match for
'application/<subtype>' in any of the other declarations.
The latest version of the code has a unit test for this but for example:
Declare Supported = ["foo/baz", "bar/baz"]
Declare Unsupported = ["foo", "bar/cat"]
SupportLevel("foo/baz") -> Supported
SupportLevel("foo/rat") -> Unsupported (catch all)
SupportLevel("bar/baz") -> Supported
SupportLevel("bar/zoo") -> Unknown
SupportLevel("bar/cat") -> Unsupported
Now for the resolution - it seems Mohit and I both agree that as currently
written
(r941) we're good with leaving the catch-all declarations in the 'Unsupported'
section.
Unless anyone as an objection, let's remove that <??? > comment from the
applicationContext.xml file.
Marty
Original comment by Brett.Mi...@gmail.com
on 12 Sep 2008 at 9:19
The information in the ext2mimetypes.txt file was collected from various
sources.
All but one were explicitly Apache 2 licensed (Apache 2.0, Tomcat 5.5,
MimeUtils).
One source of information was a Duke University website. I wrote asking
permission
to use the information and was granted it. (See attached)
Original comment by Brett.Mi...@gmail.com
on 12 Sep 2008 at 9:31
Attachments:
Fixed in revision r944
Additional Documentation is forthcoming.
Original comment by Brett.Mi...@gmail.com
on 12 Sep 2008 at 9:45
r962 | Brett.Michael.Johnson | 2008-10-01 15:42:49 -0700 (Wed, 01 Oct 2008) |
38 lines
This is a slight reworking of the changes for Issue 62.
When implementing doc example code, I ran into a minor flaw.
However, when fixing the flaw, I had extensive discussion
with John L (the flawed code was at his request), and
decided that the filename extension to mime type code
should not be in here at this time. The task is more
suitably done by third party tools such as
Mime-Util http://sourceforge.net/projects/mime-util
or
MagicMimeTypeIdentifier
http://aperture.sourceforge.net/doc/javadoc/org/semanticdesktop/aperture/mime/id
entifier/magic/MagicMi
meTypeIdentifier.html
See http://fredeaker.blogspot.com/2006/12/file-type-mime-detection.html
Change Log:
----------
M projects/connector-manager/etc/applicationContext.xml
- Improved comments in MimeTypeMap configuration.
- Dropped setting ext2mimetype.txt property.
M
projects/connector-manager/source/java/com/google/enterprise/connector/spi/Trave
rsalContext.java
- Improved comments.
- Dropped preferredMimeTypeForExtension() method.
- preferredMimeType() no longer returns null for unsupported types.
M projects/connector-
manager/source/java/com/google/enterprise/connector/traversal/ProductionTraversa
lContext.java
- Dropped preferredMimeTypeForExtension() method.
- preferredMimeType() no longer returns null for unsupported types.
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
- preferredMimeType() no longer returns null for unsupported types.
- Dropped preferredMimeTypeForExtension() method.
- Dropped configuring filename extension to mime type table.
D projects/connector-manager/etc/ext2mimetype.txt
- Removed.
Original comment by Brett.Mi...@gmail.com
on 7 Nov 2008 at 12:41
r1070 | Brett.Michael.Johnson | 2008-11-03 17:36:00 -0800 (Mon, 03 Nov 2008) |
32 lines
This is a minor follow-on modification to the changes for
Connector Manager Issue 62 - TraversalContext
After discussing the documentation implications with George,
I decided to make a couple of minor adjustments to the
mime type code:
- Restore the default value of unknownMimeTypeSupportLevel to 1.
I had changed it from 1 to 2 early in development for a reason
that eventually didn't pan out. Currently there is no behavioral
difference between the two values. In reality, only values of
0 or 1 make sense for unknownMimeTypeSupportLevel.
- Rank content types sans subtypes below content types with subtypes.
For instance, supportLevel("text") < supportLevel("text/x-foo").
Change Log:
M projects/connector-manager/etc/applicationContext.xml
- Restore default unknownMimeTypeSupportLevel value to 1
(This doesn't really change any behaviour.)
M
projects/connector-manager/source/java/com/google/enterprise/connector/traversal
/MimeTypeMap.java
- Restore default unknownMimeTypeSupportLevel value to 1
- Rank content types sans subtypes below any content types
with subtypes for a given support level.
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/MimeTypeMapTe
st.java
M projects/connector-
manager/source/javatests/com/google/enterprise/connector/traversal/SpringBasedPr
oductionTraversalConte
xtTest.java
- Fix up expected unknownMimeTypeSupportLevel in tests.
Original comment by Brett.Mi...@gmail.com
on 7 Nov 2008 at 12:45
Original comment by jl1615@gmail.com
on 12 Jan 2009 at 3:14
Original issue reported on code.google.com by
jl1615@gmail.com
on 5 Nov 2007 at 10:47