Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

mongodb connection issue Collector V3 #782

Closed angelo337 closed 2 years ago

angelo337 commented 2 years ago

hi there

I'm trying to change the default dataStoreEngine to use Mongo Db, and with the same config on the default DataStore my configuration file work just fine, however when I change the Data Store to Mongo, I'm getting an error and no crawler at all is possible, I also change Jar to a most recent version, however the same error remain; here is error Output:

18:15:58.196 [Norconex Minimum Test Page#5] ERROR Crawler - An error occured that could compromise the stability of the crawler. Stopping excution to avoid further issues... java.lang.RuntimeException: Failed to invoke java.time.ZoneId() with no args at com.google.gson.internal.ConstructorConstructor$3.construct(ConstructorConstructor.java:113) ~[gson-2.8.7.jar:?] at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:212) ~[gson-2.8.7.jar:?] at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[gson-2.8.7.jar:?] at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[gson-2.8.7.jar:?] at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$1.read(ReflectiveTypeAdapterFactory.java:131) ~[gson-2.8.7.jar:?] at com.google.gson.internal.bind.ReflectiveTypeAdapterFactory$Adapter.read(ReflectiveTypeAdapterFactory.java:222) ~[gson-2.8.7.jar:?] at com.google.gson.Gson.fromJson(Gson.java:932) ~[gson-2.8.7.jar:?] at com.google.gson.Gson.fromJson(Gson.java:897) ~[gson-2.8.7.jar:?] at com.google.gson.Gson.fromJson(Gson.java:846) ~[gson-2.8.7.jar:?] at com.google.gson.Gson.fromJson(Gson.java:817) ~[gson-2.8.7.jar:?] at com.norconex.collector.core.store.impl.mongodb.MongoDataStore.fromDocument(MongoDataStore.java:167) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.store.impl.mongodb.MongoDataStore.unwrap(MongoDataStore.java:157) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.store.impl.mongodb.MongoDataStore.deleteFirst(MongoDataStore.java:106) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.doc.CrawlDocInfoService.pollQueue(CrawlDocInfoService.java:232) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:546) ~[norconex-collector-core-2.0.0.jar:2.0.0] at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:923) [norconex-collector-core-2.0.0.jar:2.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]

Caused by: java.lang.InstantiationException at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) ~[?:1.8.0_312] at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_312] at com.google.gson.internal.ConstructorConstructor$3.construct(ConstructorConstructor.java:110) ~[gson-2.8.7.jar:?] ... 18 more

also here is my configuration at this moment:

Collector and main components:

Collector: Norconex HTTP Collector 3.0.0 (Norconex Inc.) Collector Core: Norconex Collector Core 2.0.0 (Norconex Inc.) Importer: Norconex Importer 3.0.0 (Norconex Inc.) Lang: Norconex Commons Lang 2.0.0 (Norconex Inc.) Committer(s): Core: Norconex Committer Core 3.0.0 (Norconex Inc.) Solr: Norconex Committer Solr 3.0.0 (Norconex Inc.) Runtime: Name: OpenJDK Runtime Environment Version: 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 Vendor: Private Build

Also here is a snippet of my mongo db config:

 <dataStoreEngine class="MongoDataStoreEngine">
         <connectionString>mongodb://127.0.0.1:27017/crawler_salud</connectionString>
 </dataStoreEngine>

could you please provide me some help? or point me to a resource to fix this issue?

thanks for your time and effort. best regards angelo

UtsavVanodiya7 commented 2 years ago

Hello there,

I have tried with v3.0.0 in my windows system. Used minimum config and only changed dataStoreEngine part for mongodb. Did not add any jar file. However, it works for me fine. So, we will need more details like which version of mongodb you use? Can you send full config file and the jar name and version which you updated, Operating System.

dutsuwak commented 2 years ago

I am getting the same error using MongoDB v4.4.12. Did you manage to resolve it @angelo337 ?

angelo337 commented 2 years ago

hi there after several test on Linux, and a single test on Windows, this happened just on linux. I am going to try a change in config of linux, as is a better suit for me. I will let you know angelo

aparamythis commented 2 years ago

Hey everyone,

I don't think this is related to Windows vs Linux, but rather to the fact that "vanilla" Gson instances don't know how to serialize / deserialize ZonedDateTime properly. The fix requires that the creation of the Gson instance in MongoDataStore is changed as follows:

private static final Gson GSON = new GsonBuilder().registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeConverter()).create();

A minimal implementation of the adapter class:

import java.lang.reflect.Type;
import java.time.ZonedDateTime;
import java.time.format.DateTimeFormatter;

import com.google.gson.JsonDeserializationContext;
import com.google.gson.JsonDeserializer;
import com.google.gson.JsonElement;
import com.google.gson.JsonParseException;
import com.google.gson.JsonPrimitive;
import com.google.gson.JsonSerializationContext;
import com.google.gson.JsonSerializer;

public class ZonedDateTimeConverter implements JsonSerializer<ZonedDateTime>, JsonDeserializer<ZonedDateTime> {
    private static final DateTimeFormatter FORMATTER = DateTimeFormatter.ISO_DATE_TIME;

    @Override
    public ZonedDateTime deserialize(JsonElement json, Type typeOfT, JsonDeserializationContext context)
            throws JsonParseException {
        return FORMATTER.parse(json.getAsString(), ZonedDateTime::from);
    }

    @Override
    public JsonElement serialize(ZonedDateTime src, Type typeOfSrc, JsonSerializationContext context) {
        return new JsonPrimitive(FORMATTER.format(src));
    }
}

Hope this helps,

Alex

angelo337 commented 2 years ago

hi there, I don't really get it, is a issue with data format? if so, why is working on WinX and not in Linux?. thanks

aparamythis commented 2 years ago

Hi there! In our case the problem was independent of whether we run the system on windows or on linux -- we had the issue always. Maybe the installations in your case are the same, but the data in Mongo not? The issue manifests itself most usually when deserialization is required.

angelo337 commented 2 years ago

hi Alexandros: I just tested out of the box, just install Mongo, run Norconex V3 and configure DataStore to be Mongo. no changes in any other parameters on Both WinX and Linux. best regards

aparamythis commented 2 years ago

hi again! ok, understood, maybe this is a different issue to the one we were having then. or, maybe it's related to jdk distributions / versions? can't say with any certainty, and I'm afraid we can't do more testing at the moment. nevertheless, the fix I mentioned above does work for serialization / deserialization of ZonedDateTime instances to mongo, and is the same for the jdbc store, so maybe this helps others that encounter similar problems :-)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.