Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Mongo exception when crawling files of contenttype application/rss+xml #242

Closed OkkeKlein closed 8 years ago

OkkeKlein commented 8 years ago

My Crawler Name: 2016-04-20 14:41:50 ERROR - My Crawler Name: Could not mark reference as processed: URL (can't serialize class com.norconex.commons.lang.file.ContentType) java.lang.IllegalArgumentException: can't serialize class com.norconex.commons.lang.file.ContentType at org.bson.BasicBSONEncoder._putObjectField(BasicBSONEncoder.java:299) at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:194) at org.bson.BasicBSONEncoder._putObjectField(BasicBSONEncoder.java:255) at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:194) at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:136) at com.mongodb.DefaultDBEncoder.writeObject(DefaultDBEncoder.java:36) at com.mongodb.BSONBinaryWriter.encodeDocument(BSONBinaryWriter.java:339) at com.mongodb.UpdateCommandMessage.writeTheWrites(UpdateCommandMessage.java:48) at com.mongodb.UpdateCommandMessage.writeTheWrites(UpdateCommandMessage.java:23) at com.mongodb.BaseWriteCommandMessage.encodeMessageBody(BaseWriteCommandMessage.java:69) at com.mongodb.BaseWriteCommandMessage.encodeMessageBody(BaseWriteCommandMessage.java:23) at com.mongodb.RequestMessage.encode(RequestMessage.java:66) at com.mongodb.BaseWriteCommandMessage.encode(BaseWriteCommandMessage.java:53) at com.mongodb.DBCollectionImpl.sendWriteCommandMessage(DBCollectionImpl.java:520) at com.mongodb.DBCollectionImpl.access$200(DBCollectionImpl.java:48) at com.mongodb.DBCollectionImpl$2.execute(DBCollectionImpl.java:470) at com.mongodb.DBCollectionImpl$2.execute(DBCollectionImpl.java:461) at com.mongodb.DBPort.doOperation(DBPort.java:187) at com.mongodb.DBTCPConnector.doOperation(DBTCPConnector.java:208) at com.mongodb.DBCollectionImpl.writeWithCommandProtocol(DBCollectionImpl.java:461) at com.mongodb.DBCollectionImpl.updateWithCommandProtocol(DBCollectionImpl.java:456) at com.mongodb.DBCollectionImpl.update(DBCollectionImpl.java:270) at com.mongodb.DBCollection.update(DBCollection.java:214) at com.mongodb.DBCollection.update(DBCollection.java:247) at com.norconex.collector.core.data.store.impl.mongo.MongoCrawlDataStore.processed(MongoCrawlDataStore.java:203) at com.norconex.collector.core.crawler.AbstractCrawler.finalizeDocumentProcessing(AbstractCrawler.java:636) at com.norconex.collector.core.crawler.AbstractCrawler.processImportResponse(AbstractCrawler.java:544) at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:491) at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:377) at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:735) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

essiembre commented 8 years ago

The unit tests for the project with Mongo are successful, so I wonder if you have a version mismatch.

Which versions of the Collector HTTP and Collector Core are you using? They should be 2.5.0 and 1.5.0 respectively (both snapshots).

You can check the jar file names to find out in your lib folder, or you can copy the first few lines of your log that prints the versions (if set).

OkkeKlein commented 8 years ago

[non-job]: 2016-04-20 14:50:19 INFO - Version: Norconex HTTP Collector 2.5.0-SNAPSHOT (Norconex Inc.) [non-job]: 2016-04-20 14:50:19 INFO - Version: Norconex Collector Core 1.5.0-SNAPSHOT (Norconex Inc.) [non-job]: 2016-04-20 14:50:19 INFO - Version: Norconex Importer 2.5.2-SNAPSHOT (Norconex Inc.) [non-job]: 2016-04-20 14:50:19 INFO - Version: Norconex JEF 4.0.7 (Norconex Inc.) [non-job]: 2016-04-20 14:50:19 INFO - Version: Norconex Committer Core 2.0.3 (Norconex Inc.)

essiembre commented 8 years ago

Strange. It is just that content type? I will look at serializing that object a different way and will update you.

essiembre commented 8 years ago

Can you try by replacing the collector-core jar with the latest snapshot. You can download just the jar here.

NOTE: you will need to perform a clean crawl (delete your crawl store) because this fix changes how the ContentType gets serialized for Mongo and will not be compatible.

OkkeKlein commented 8 years ago

Just did test on html content. Same problem.

BTW download link to collector-core snapshot is showing 404.

essiembre commented 8 years ago

I fixed the link. Here it is again: https://oss.sonatype.org/content/repositories/snapshots/com/norconex/collectors/norconex-collector-core/1.5.0-SNAPSHOT/norconex-collector-core-1.5.0-20160420.203027-2.jar

OkkeKlein commented 8 years ago

That one works.

essiembre commented 8 years ago

You mean the link or the fix? :-) Can I close?

OkkeKlein commented 8 years ago

Yes. You can close.

essiembre commented 8 years ago

Thanks for confirming. FYI, I just made a new snapshot of the HTTP Collector that contains the Mongo-fixed collector-core.