dkm05midhra / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Quartz scheduler + crawler4J http connection error #254

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a job with Quartz
2. Insert the crawler4j controller in the Quartz execute() method
3. Execute the job

Hi,

I'm trying to combine Quartz scheduler with crawler4j.

The problem is that when I execute the C4J code in a main method it works well, 
but in the quartz Job execute() method, there is a Http connection error.

We are working behind a proxy but it's already configured winthin C4j and we 
even tried in Quartz.

Do you know if Quartz can block the Http Connection ?

Error Stacktrace :

    Exception in thread "Crawler 1" java.lang.NoSuchFieldError: DEF_PROTOCOL_CHARSET
    at org.apache.http.auth.params.AuthParams.getCredentialCharset(AuthParams.java:64)
    at org.apache.http.impl.auth.BasicScheme.authenticate(BasicScheme.java:157)
    at org.apache.http.client.protocol.RequestAuthenticationBase.authenticate(RequestAuthenticationBase.java:125)
    at org.apache.http.client.protocol.RequestAuthenticationBase.process(RequestAuthenticationBase.java:83)
    at org.apache.http.client.protocol.RequestProxyAuthentication.process(RequestProxyAuthentication.java:89)
    at org.apache.http.protocol.ImmutableHttpProcessor.process(ImmutableHttpProcessor.java:108)
    at org.apache.http.protocol.HttpRequestExecutor.preProcess(HttpRequestExecutor.java:174)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:515)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
    at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
    at edu.uci.ics.crawler4j.fetcher.PageFetcher.fetchHeader(PageFetcher.java:156)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:232)
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:189)
    at java.lang.Thread.run(Thread.java:662)

The execute() method :

    @Override
    public void execute(JobExecutionContext context)
            throws JobExecutionException {

        JobKey key = context.getJobDetail().getKey();
        JobDataMap dataMap = context.getJobDetail().getJobDataMap();

        String[] sitesTab = dataMap.getString("sites").split(";");

        int numberOfCrawlers = 2;
        String storageFolder = "C:\\...";

        CrawlConfig config = new CrawlConfig();
        config.setProxyHost("...");
        config.setProxyPort(3128);
        config.setProxyUsername("...");
        config.setProxyPassword("...");
        config.setMaxDepthOfCrawling(2);
        config.setCrawlStorageFolder(storageFolder);

        config.setIncludeBinaryContentInCrawling(true);

        String[] crawlDomains = new String[] { "http://www.....fr/" };      

        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        robotstxtConfig.setEnabled(false);
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig,
                pageFetcher);
        CrawlController controller;
        try {
            controller = new CrawlController(config, pageFetcher,
                    robotstxtServer);
            for (String domain : crawlDomains) {
                controller.addSeed(domain);
            }

            int minWidth = 150;
            int minHeight = 150;
            Pattern p = Pattern.compile(".*(\\.(bmp|gif|jpe?g|png))$");
            SportifsWebCrawler.configure(crawlDomains, storageFolder, p,
                    minWidth, minHeight);

            controller.start(SportifsWebCrawler.class, numberOfCrawlers);
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

Thanks for helping :)

Original issue reported on code.google.com by stratege...@gmail.com on 5 Feb 2014 at 1:30