Norconex / collector-filesystem

Norconex Filesystem Collector is a flexible crawler for collecting, parsing, and manipulating data ranging from local hard drives to network locations into various data repositories such as search engines.
http://www.norconex.com/collectors/collector-filesystem/
22 stars 13 forks source link

Where is documentation on saving results to a database ? #30

Closed pkasson closed 6 years ago

essiembre commented 6 years ago

For a database, you likely want to use the SQL Committer.

I recommend you use the 2.0.0-SNAPSHOT version since version 1.0.0 may be too limited for some databases.

pkasson commented 6 years ago

Great - thanks ! I currently don't have a section as stated here:

you can use the following XML to configure a SQL database in the section of your Norconex Collector configuration.

Can this be configured in the code, or what would you suggest ?

pkasson commented 6 years ago

I added an XML config for the committer and now suddenly I am getting this ... related ? (I am running this all in Eclipse for now) ... didn't have this earlier:

INFO - JobSuite - JEF work directory is: ./progress INFO - JobSuite - JEF log manager is : FileLogManager INFO - JobSuite - JEF job status store is : FileJobStatusStore INFO - AbstractCollector - Suite of 0 crawler jobs created. INFO - JobSuite - Initialization... INFO - JobSuite - No previous execution detected. FATAL - JobSuite - Job suite execution failed: null java.lang.IllegalArgumentException: Root job cannot be null. at com.norconex.jef4.status.JobSuiteStatusSnapshot.create(JobSuiteStatusSnapshot.java:130)

essiembre commented 6 years ago

You can add via code for sure. You use FilesystemCrawlerConfig#setCommitter(...).

The error you are getting normally means you have not given an id to your collector and/or crawler configuration.

Both FilesystemCollectorConfig#setId() and FilesystemCrawlerConfig#setId(...) must be set.

pkasson commented 6 years ago

Great, got that fixed. Now, I have added a committer to my configuration, but don't see good examples on getting this perfect, so for now, crawler is running but not saving to table. Here is my configuration

<fscollector id="Text Files">
    <logsDir>${workdir}/logs</logsDir>
    <progressDir>${workdir}/progress</progressDir>
    <crawlers>
        <crawler id="Sample Crawler">
            <workDir>${workdir}</workDir>
            <startPaths>
                <path>${path}</path>
            </startPaths>
            <numThreads>2</numThreads>
            <keepDownloads>false</keepDownloads>
            <importer>
                <postParseHandlers>
                    <tagger class="${tagger}.ReplaceTagger">
                        <replace fromField="samplefield" regex="true">
                            <fromValue>ping</fromValue>
                            <toValue>pong</toValue>
                        </replace>
                        <replace fromField="Subject" regex="true">
                            <fromValue>Sample to crawl</fromValue>
                            <toValue>Sample crawled</toValue>
                        </replace>
                    </tagger>
                </postParseHandlers>
            </importer>
            <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>${workdir}/crawledFiles</directory>
            </committer>
        </crawler>
    </crawlers>
    <committer class="com.norconex.committer.sql.SQLCommitter">
        <driverClass>com.mysql.jdbc.Driver</driverClass>
        <connectionUrl>jdbc:mysql://server:3306/database</connectionUrl>
        <tableName>crawler</tableName>
        <createMissing>true</createMissing>
        <properties>
        </properties>
        <fixFieldNames>true</fixFieldNames>
        <fixFieldValues>true</fixFieldValues>
        <username>username</username>
        <password>password</password>

    </committer>
</fscollector>
essiembre commented 6 years ago

Your committer configuration must be inside your crawler section, and you can have only one (for multiple, consider using the MultiCommitter). The XML structure documented here must be respected: http://www.norconex.com/collectors/collector-filesystem/configuration

If you are running from the command line, you can add the -k flag to prevent execution if there are XML configuration errors.

pkasson commented 6 years ago

Ok, updated configuration and no errors when run ... but also no data in table. Clearly it is rejecting everything, maybe I am not getting the collector logic.

All entries look like this:

INFO - DOCUMENT_METADATA_FETCHED - DOCUMENT_METADATA_FETCHED: file:///Users/pkasson/temp/quickrad/target/classes/martindale/files/martindale_78211.txt INFO - REJECTED_UNMODIFIED - REJECTED_UNMODIFIED: file:///Users/pkasson/temp/quickrad/target/classes/martindale/files/martindale_78211.txt INFO - DOCUMENT_METADATA_FETCHED - DOCUMENT_METADATA_FETCHED: file:///Users/pkasson/temp/quickrad/target/classes/martindale/files/martindale_78212.txt INFO - REJECTED_UNMODIFIED - REJECTED_UNMODIFIED: file:///Users/pkasson/temp/quickrad/target/classes/martindale/files/martindale_78212.txt INFO - DOCUMENT_METADATA_FETCHED - DOCUMENT_METADATA_FETCHED: file:///Users/pkasson/temp/quickrad/target/classes/martindale/files/martindale_78213.txt

I believer I found some posts that added this, even though it was stated that it is default to search for all:

    <documentFilters>
        <filter
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" caseSensitive="false">.*</filter>
    </documentFilters>
essiembre commented 6 years ago

It tells you why they were rejected: REJECTED_UNMODIFIED. That means the documents have not changed since they were last crawled so they will not be sent to your committer again. You can delete your workdir (or "crawlstore") directory before crawling to have it start fresh. You can also disable creating checksums altogether, so everything will always appear as if it was new/modified:

 <metadataChecksummer disabled="true" />
 <documentChecksummer disabled="true" />
pkasson commented 6 years ago

Ok, makes sense. Ran again and it was trying to add data to the database. So thats the next issue - I see no SQL for how to create the crawler table and it does not appear to auto create the table:

com.norconex.committer.core.CommitterException: Could not commit batch to database.

INSERT INTO crawler(collector_content_type,date,document_contentFamily,X_Parsed_By,collector_lastmodified,collector_content_encoding,collector_filesize,dcterms_modified,Last_Modified,title,targetReferenceField,Last_Save_Date,document_reference,embeddedRelationshipId,meta_save_date,dc_title,collector_is_crawl_new,document_contentType,Content_Encoding,modified,Content_Length,targetContentField,Content_Type) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) Parameters: [, 2007-11-14T07:16:06Z|2007-11-14T07:16:08Z, archive,

pkasson commented 6 years ago

I found an article that showed this, but it does not create a table:

 <tableName>test_table</tableName>
  <createMissing>true</createMissing>
essiembre commented 6 years ago

Try with version 2.0.0-SNAPSHOT. You can find all available options here: https://www.norconex.com/collectors/committer-sql/configuration

You will find on that page SQL examples to create the table and new fields if they do not already exist. You can create them upfront yourself if you want specific data types.

pkasson commented 6 years ago

I must be missing something - I have seen this SQL, but when I tried it as it is shown, the SQL committer does not work:

CREATE TABLE ${tableName} ( ${targetReferenceField} VARCHAR(32672) NOT NULL, ${targetContentField} CLOB, PRIMARY KEY ( ${targetReferenceField} ), title VARCHAR(256) )

Clearly, if one were to create their own table structure, the system needs to know how to map the fields. Using that SQL above, when it runs, it fails with:

Caused by: java.sql.SQLException: Unknown column 'id' in 'where clause' Query: SELECT 1 FROM crawler WHERE id = ? Parameters: [file:///Users/pkasson/temp/quickrad/lib/jdom-1.1.jar]

Where can I find the SQL to fully build the minimal structure and then how does that get mapped so the app knows (or are there default fields, like the 'id' above) how to store data ?

Thank you.

pkasson commented 6 years ago

Tried to build the SQL based on the error received, no joy:

Incorrect datetime value: '1497860754000' for column 'collector_lastmodified'. I changed to the column type to date time and still failed.

Still don't see an example of a table create will all fields :(

Clearly, the app expects a table with following fields ... but not able to get this to make the app work (put this together based on the failed insert statement):

CREATE TABLE crawler ( id int(11) DEFAULT NULL, content blob, collector_content_type varchar(100) COLLATE utf8_bin DEFAULT NULL, date varchar(100) COLLATE utf8_bin DEFAULT NULL, document_contentFamily varchar(100) COLLATE utf8_bin DEFAULT NULL, X_Parsed_By varchar(500) COLLATE utf8_bin DEFAULT NULL, collector_lastmodified datetime DEFAULT NULL, collector_content_encoding varchar(100) COLLATE utf8_bin DEFAULT NULL, collector_filesize int(8) DEFAULT NULL, dcterms_modified varchar(100) COLLATE utf8_bin DEFAULT NULL, Last_Modified varchar(100) COLLATE utf8_bin DEFAULT NULL, title varchar(100) COLLATE utf8_bin DEFAULT NULL, Last_Save_Date varchar(100) COLLATE utf8_bin DEFAULT NULL, document_reference varchar(100) COLLATE utf8_bin DEFAULT NULL, embeddedRelationshipId varchar(100) COLLATE utf8_bin DEFAULT NULL, meta_save_date varchar(100) COLLATE utf8_bin DEFAULT NULL, dc_title varchar(100) COLLATE utf8_bin DEFAULT NULL, collector_is_crawl_new varchar(100) COLLATE utf8_bin DEFAULT NULL, document_contentType varchar(100) COLLATE utf8_bin DEFAULT NULL, Content_Encoding varchar(100) COLLATE utf8_bin DEFAULT NULL, modified varchar(100) COLLATE utf8_bin DEFAULT NULL, Content_Length varchar(100) COLLATE utf8_bin DEFAULT NULL, Content_Type varchar(100) COLLATE utf8_bin DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

essiembre commented 6 years ago

The Collector does not "expect" any fields to exists in your database table. It does not know upfront the fields it will discover. You control what you want to keep and you should store only what you need. If you want to store arbitrary fields, maybe a relational database is not your best choice?

Normally, you would define upfront what you want to capture, and then use features of the Importer module such as the RenameTagger or KeepOnlyTagger to do the "mapping" you are describing and send only what you need to your database.

Still, if you really want to store everything the crawler discovers (which can be a lot and may vary for each document), you can provide the SQL for the createFieldSQL configuration option and new fields will be created dynamically as needed. Some of the discovered field names may not be compliant with your database field name syntax. If so, an option is to set both fixFieldNames and fixFieldValues to true.

pkasson commented 6 years ago

Ok, but then why was it building a SQL statement on its own ... clearly it had a mapping of some input content metadata as it was crafting the following SQL:

Query: INSERT INTO crawler(collector_content_type,date,document_contentFamily,X_Parsed_By,collector_lastmodified,collector_content_encoding,collector_filesize,dcterms_modified,Last_Modified,title,Last_Save_Date,content,document_reference,embeddedRelationshipId,meta_save_date,dc_title,collector_is_crawl_new,document_contentType,Content_Encoding,modified,id,Content_Length,Content_Type) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) Parameters: [, 2007-11-14T07:16:06Z|2007-11-14T07:16:08Z, archive ...

Clearly it formed that from something inside the code as it did not determine any of those field names from the source.

But, can you provide a sample SQL committer configuration that would do what you say, and I agree that I don't want to store the content, just metadata and the like, so that I could select ONLY what we want to store ?

Thanks !

essiembre commented 6 years ago

By default, the collector will send to your committer all metadata fields it finds, plus some it comes up with itself. Typically, those starting with collector* come from the Collector and those starting with document* comes from the Importer module. The rest should be extracted from the file itself or its properties. The SQL gets built dynamically with all fields found (so there is no mapping per se).

So within your configuration, you can do some mapping/filtering as needed as illustrated below. Let's assume you only want to keep the document dc.title and the collector.filesize but you want to rename them as title and size respectively:

<importer>
    ... 
    <postParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>dc.title,collector.filesize</fields>
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="dc.title" toField="title" overwrite="true" />
            <rename fromField="collector.filesize" toField="size" overwrite="true" />
        </tagger>
    </postParseHandlers>
    ...
</importer>

With the above, the SQL that will be constructed should look like this:

INSERT INTO crawler(title,size) VALUES (?, ?) 

Any clearer?

pkasson commented 6 years ago

Thanks - that does make sense now. So if the crawler runs against different directories, then it could potentially create new columns dynamically depending on how/what metadata it finds, if it is not mapped manually, right ?

com.norconex.committer.core.CommitterException: Could not commit batch to database.

due to ...

Caused by: java.sql.SQLException: Data source is closed at org.apache.commons.dbcp2.BasicDataSource.createDataSource(BasicDataSource.java:2016) at org.apache.commons.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:1533) at org.apache.commons.dbutils.AbstractQueryRunner.prepareConnection(AbstractQueryRunner.java:204) at org.apache.commons.dbutils.QueryRunner.query(QueryRunner.java:287) at com.norconex.committer.sql.SQLCommitter.runExists(SQLCommitter.java:584) at com.norconex.committer.sql.SQLCommitter.recordExists(SQLCommitter.java:575) at com.norconex.committer.sql.SQLCommitter.sqlInsertDoc(SQLCommitter.java:527) at com.norconex.committer.sql.SQLCommitter.addOperation(SQLCommitter.java:508) at com.norconex.committer.sql.SQLCommitter.commitBatch(SQLCommitter.java:465)

Verified the connection is valid.

pkasson commented 6 years ago

Just ran a simple query using the apache library, where the crash occurred:

`Class.forName("com.mysql.jdbc.Driver");

String db = "jdbc:mysql://localhost/kasstek?user=root&password=rodnoc";

Connection connection = DriverManager.getConnection(db);

MapListHandler beanListHandler = new MapListHandler();

QueryRunner runner = new QueryRunner();

List<Map<String, Object>> list = runner.query(connection, "SELECT * FROM crawler", beanListHandler);

System.out.println(list.toString());`

and, after adding a single row with a single column, here are results:

[{id=1}]

So ... the issue is not the connection or the table, but something else ?

pkasson commented 6 years ago

After running again, on a similar path of files, I get this now:

Caused by: java.sql.SQLException: Unknown column 'collector_content_type' in 'field list' Query: INSERT INTO crawler(collector_content_type,date,document_contentFamily,X_Parsed_By,collector_lastmodified,collector_content_encoding,collector_filesize,dcterms_modified,Last_Modified,title,Last_Save_Date,content,document_reference,embeddedRelationshipId,meta_save_date,dc_title,collector_is_crawl_new,document_contentType,Content_Encoding,modified,id,Content_Length,Content_Type) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) Parameters: [ ... ]

But your last post indicated only the fields included in the tagger would be created, but here is that configuration:

            <postParseHandlers>
                <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                    <fields>dc.title,collector.filesize</fields>
                </tagger>
                <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                    <rename fromField="dc.title"           toField="title" overwrite="true" />
                    <rename fromField="collector.filesize" toField="size" overwrite="true" />
                </tagger>
            </postParseHandlers>
essiembre commented 6 years ago

Are the tags at the right place? If so, your committer queue likely has a bunch of older documents waiting to be sent, having all these fields. Clear your committer queue and try again.

About your other question, you are right that new fields may be added as discovered if you do not filter them.

pkasson commented 6 years ago

Succes !!

Well ... I didn't see that committer-queue folder and that was probably the issue all along - THANK YOU for pointing that out. I removed all data in that folder and it committed to the table !

  1. Now ... I thought I would only have the file name (dc.title) and the file size but it also is storing the content, which i don't want and didn't have mapped ... assuming there is a way to exclude this, even though it was not mapped ?

  2. You mentioned that the system can add new columns based on what it finds in other documents, how does it classify new content and then assign column names to that other content ?

  3. Also, scanned another directory and it had XML in it, some of which caused parsing errors, but I don't want the content parsed, but just cataloged for now - how to skip parsing of content so at least it gets committed (since there are errors in parsing XML, those don't get committed or cause other issues), since I don't want to store the content anyway ?

essiembre commented 6 years ago
  1. The Collector treats documents as "content" + "fields" (separate things). If you want to eliminate the content, one way to do so is to use the following Importer transformer as a post-parse handler:

    <transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer"
          begin="0" end="0"/>

    You'll have to test whether it will still creates the content field or not, but it should be empty if so.

  2. If you filter the fields to only what you want to keep (e.g. KeepOnlyTagger) it will not create new fields beyond those you defined. What fields are discovered are based on the internal parser used for each mime-type. You may want to have a look at Apache Tika for more details (the third-party library used to parse most files).

  3. The importer module allow you to modify the parser used with its configuration tag <documentParserFactory>. You can tell the default implementation to ignore some content types. See the <ignoredContentTypes> configuration, described in GenericDocumentParserFactory.

pkasson commented 6 years ago
  1. Put transformer into postParseHandler, but did not work , still have content in the table:
        <postParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer" begin="0" end="0" />
essiembre commented 6 years ago

Can you attach your latest config with these changes in?

pkasson commented 6 years ago
<!-- Copyright 2010-2017 Norconex Inc. Licensed under the Apache License, 
    Version 2.0 (the "License"); you may not use this file except in compliance 
    with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 
    Unless required by applicable law or agreed to in writing, software distributed 
    under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES 
    OR CONDITIONS OF ANY KIND, either express or implied. See the License for 
    the specific language governing permissions and limitations under the License. -->
<fscollector id="Text Files">
    <logsDir>${workdir}/logs</logsDir>
    <progressDir>${workdir}/progress</progressDir>
    <crawlers>
        <crawler id="Sample Crawler">
            <workDir>${workdir}</workDir>
            <startPaths>
                <path>${path}</path>
            </startPaths>
            <numThreads>20</numThreads>
            <keepDownloads>false</keepDownloads>
            <documentFilters>
                <filter
                    class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                    onMatch="include" caseSensitive="false">.*</filter>
            </documentFilters>
            <importer>
                <postParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.SubstringTransformer" begin="0" end="0" />
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>dc.title,collector.filesize,collector.lastmodified"</fields>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                        <rename fromField="dc.title"              toField="title" overwrite="true" />
                        <rename fromField="collector.filesize"     toField="size" overwrite="true" />
                        <rename fromField="collector.lastmodified" toField="lastmodified" overwrite="true" />
                    </tagger>
                </postParseHandlers>

            </importer>
            <committer class="com.norconex.committer.sql.SQLCommitter">
                <driverClass>com.mysql.jdbc.Driver</driverClass>
                <connectionUrl>jdbc:mysql://localhost:3306/servername</connectionUrl>
                <username>username</username>
                <password>password</password>
                <tableName>crawler</tableName>

                <fixFieldNames>true</fixFieldNames>
                <fixFieldValues>true</fixFieldValues>

                <queueSize>100</queueSize>
                <commitBatchSize>10</commitBatchSize>
            </committer>
        </crawler>
    </crawlers>
</fscollector>
essiembre commented 6 years ago

It turns out the Importer interprets transforming content into "nothing" to be the same as no transformation being done, so the original content remains unchanged. I will mark it as a feature request to be able to easily remove the content from a document.

In the meantime, you can use the ReplaceTransformer instead to replace the content with a space or any value you want (as opposed to an empty string). See if that makes a difference:

  <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
      <replace>
          <fromValue>.*</fromValue>
          <toValue xml:space="preserve"> </toValue>
      </replace>
  </transformer>
pkasson commented 6 years ago

Woo hoo. That worked. I have not yet told it skip XML but from the point you mode on item 3 above, I can't find a good example of how to do this:

The importer module allow you to modify the parser used with its configuration tag . You can tell the default implementation to ignore some content types. See the configuration, described in GenericDocumentParserFactory.

I don't see where to get the tags from Tika, for things like file creation date, modified, etc. It looks like Tika would have different metadata for different file types, that is normal, but for each type, where is the list of tags to put into the tagger section. Things like dc.title and collector.filesize - where is that master list ?

essiembre commented 6 years ago

You can find a list of common fields in these locations:

There is no master list. There are common or frequent ones, but document authors can add extra properties to a file. PDF, Word, etc, can all have custom properties.

One thing you can do other than committing all fields is using the DebugTagger to log all fields accumulated so far.

pkasson commented 6 years ago

Thanks for that info - it did help by showing content from a file system crawler. However, even with Tika, with a simple test I ran, it does NOT show file created date or modified date, however your collector has the last modified ...so confused how the meta data lookup is working. I am looking to get the create date for a file, and here is what my debug tagger showed:

collector.content-type= document.contentFamily=other X-Parsed-By=org.apache.tika.parser.EmptyParser collector.lastmodified=1524222148000 collector.content-encoding= document.contentType=application/octet-stream collector.filesize=6148 document.reference=file:///Users/someone/temp/.DS_Store

So, I would expect the collector to have a created or date value just like lastmodified. I tried date and created, etc, but nothing was found.

essiembre commented 6 years ago

The date created is not provided by the Collector itself. So if it is not provided by Tika, you will not have it. If you need it, please create a new ticket about it and we'll make it a new feature request.

For new questions (not related to your original one), please create new issues.

pkasson commented 6 years ago

Just submitted the issue. Thanks.

pkasson commented 6 years ago

I think I got it from here ... just two last questions and I will close this issue:

  1. I am capturing the file details first, and then later will want to do actual parsing of content based on a selection of the fields crawled, is it possible to flag certain files / mime types in the database (maybe through a custom crawling process) to selectively obtain content after the initial call to catalog all files ?
  2. Are there callback hooks in the application to let the crawling invocation app (what I write) which would notify that a file is updated, removed or added, etc, during a session, for a given file folder starting location ?
essiembre commented 6 years ago
  1. You can add a constant value that will be sent to your Committer with other fields extracted. If you want to crawl from a dynamic list, coming from your database, here are two options:

    • Write an external script that will generate a pathsFile holding all your paths.
    • Write a custom IStartPathsProvider that pulls what you need from your database.
  2. The committer is notified of every updates and deletions. You can create a custom committer wrapping another one. You can also create your own ICrawlerEventListener implementation to be notified of many different types of crawling events.

pkasson commented 6 years ago

Not sure how yet to pass in other fields to committer, but simple enough to put some default fields in the table which would accomplish same. So, this custom process would select rows with paths to crawl and then any content normally parsed would be re-parsed ? I thought if a crawler found same data it would skip it.

For #2, I am referring handling errors during the crawling process, not committer process. Problems I see are things like:

com.norconex.importer.parser.DocumentParserException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@2f80a608 at com.norconex.importer.parser.impl.AbstractTikaParser.parseDocument(AbstractTikaParser.java:154)

2018-04-30 05:28:34,367 INFO [pool-1-thread-1]: DOCUMENT_COMMITTED_ADD - DOCUMENT_COMMITTED_ADD: file:///Volumes/Data/Work/Documents/Database2.accdb Caught and handled this exception : java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at com.github.jaiimageio.impl.common.ImageUtil.processOnRegistration(ImageUtil.java:1401) at com.github.jaiimageio.impl.plugins.wbmp.WBMPImageWriterSpi.onRegistration(WBMPImageWriterSpi.java:103) at java.desktop/javax.imageio.spi.SubRegistry.registerServiceProvider(ServiceRegistry.java:788) at java.desktop/javax.imageio.spi.ServiceRegistry.registerServiceProvider(ServiceRegistry.java:330) at java.desktop/javax.imageio.spi.IIORegistry.registerApplicationClasspathSpis(IIORegistry.java:212) at java.desktop/javax.imageio.spi.IIORegistry.(IIORegistry.java:136) at java.desktop/javax.imageio.spi.IIORegistry.getDefaultInstance(IIORegistry.java:157) at java.desktop/javax.imageio.ImageIO.(ImageIO.java:66) at org.apache.tika.parser.image.ImageParser.parse(ImageParser.java:179) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

essiembre commented 6 years ago

It looks like I may not understand fully what you are asking. You are right, the Collector will not reprocess unmodified files. It stores a checksum to figure that out. You can influence how that checksum gets created. If you want to store only specific fields, have a look at the Importer module. It has a few options to rename, copy, delete, keep, etc. fields.

For capturing errors or other crawler events, you can implement a ICrawlerEventListener, configured like this.

pkasson commented 6 years ago

Sorry if I was confusing. I would like to accomplish this:

  1. First pass, collect files, but don't parse any content
  2. Depending on some criteria, parse content of files

This way, the whole list of files can be catalogued, but not parsed, only files of a certain type, at a certain time could be parsed. I suppose the particular data in the database could be deleted and the parser (from step #2 above) would see it as a new file, or use a second committer table with a separate configuration.

pkasson commented 6 years ago

What is the trick for ignoring files ... your comment earlier on ignoredfileTypes - only found a few references for how to configure, and assumed mime type and just pure file suffix, tried this but still parsing XML:

                <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
                   <ignoredContentTypes>application/xml,*/xml,*/pdf,application/pdf</ignoredContentTypes>
                </documentParserFactory>
essiembre commented 6 years ago

I understand better now.

Any reason why to do it in two passes like that? Can you filter what you want on your first crawl and simply recrawl as needed? Otherwise, you will end up with a copy of every file, only to achieve the same result in the end. Is it because your files are on a network with limited access or something like that?

If you truly need to copy all files, you can have a look at the <keepDownloads>true</keepDownloads> flag and it will store in a download folder all files it crawled. I suppose you can recrawl from there afterward.

For ignoring some content types, did you put that in the <importer> section? Make sure you have no configuration validation errors at the beginning of your log.

Also, have you tried keeping the content type (or print using DebugTagger) to confirm they match the list you configured?

pkasson commented 6 years ago

Yes, there are a few reasons for two passes. One is remote location of files and the sheer size of data and we do not want to keepDownloads. Second is to get initial catalog of all data and then determine what to parse. This parse list would change potentially with what is being targeted.

I have the debug tagger enabled and verified the content type: document.contentType=application/xml

It is defined in the importer section:


   <importer>
        <parseErrorsSaveDir>
                 ${parseErrorsSaveDir}                
        </parseErrorsSaveDir>
        <documentParserFactory class="com.norconex.importer.parser.GenericDocumentParserFactory">
                  <ignoredContentTypes>application/xml,application/pdf</ignoredContentTypes>
          </documentParserFactory>
essiembre commented 6 years ago

I am afraid you will have to put in place something custom (like custom Committer) for the double-passes.

If your objective is to get "raw" files in your first pass, maybe a simple remote copy would work? Then you run the Collector on those copied files?

pkasson commented 6 years ago

Understood. I still am not successfully avoiding the parsing of types such as xml and pdf - see my post above - what is coded wrong with the documentParserFactory ?

essiembre commented 6 years ago

Your <ignoredContentTypes> does not contain a valid regular expression. It should be more like this (not tested):

<ignoredContentTypes>.*/xml|.*/pdf</ignoredContentTypes>
pkasson commented 6 years ago

Yes to your earlier comment - a custom double pass. I will try the regex ... just ran it and no parsing issues - those two mime types were ignored - thanks !

pkasson commented 6 years ago

Ok, when we run our 2nd custom run on certain folders, perhaps using a 2nd database since first run stored files already and they would be skipped.

Now that this is working, is there a wiki / doc to explain how to do the following and how to store:

Extract text out of many file formats (HTML, PDF, Word, etc.) Extract metadata associated with documents. OCR support on images and PDFs.

essiembre commented 6 years ago

Extracting content + matadata from many formats is automatically done for you (transparent). You use a Committer to store it in a repository of your choice.

For OCR, have a look at GenericDocumentParserFactory.

pkasson commented 6 years ago

So the committing of content can be stored in a separate repo, and not the MySQL database ?

essiembre commented 6 years ago

It can be stored wherever you want. That is the job of the Committer to do so. You can use one of the existing committers (like I think you are doing with the SQL Committer) or create your own.

pkasson commented 6 years ago

Great, thanks for all your help in getting us started ... I will close this issue now.

hardreddata commented 4 years ago

Very useful discussion. Thanks.