bejean / crawl-anywhere

Crawl-Anywhere - Web Crawler and document processing pipeline with Solr integration.
www.crawl-anywhere.com
Apache License 2.0
96 stars 38 forks source link

first crawl date is null #39

Closed torhar closed 10 years ago

torhar commented 11 years ago

the first crawl date of an item is null when i try to rescan a source. this leads to a numberformatexception in WebConnector.java

String firstCrawlDate = StringUtils.trimToEmpty(queue.getCreated(itemData));
                        Date d = null;
                        if ("".equals(firstCrawlDate) ) {
                            d = new Date();
                        } else {
                            d = new Date(Long.parseLong(firstCrawlDate));
                        }   
                        params.put("firstCrawlDate", dateFormat.format(d.getTime()));

a quick solutuion would be to check for

if (firstCrawlDate == null || "".equals(firstCrawlDate) ) {

but i don't know if it is correct in this situation that the firstCrawlDate is nuli.

torhar commented 11 years ago

and in Source.java srcData.get(Name) seems to be a Long, so

protected int getSrcDataInt(String name) { if (srcData.containsKey(name)) return ((Integer)srcData.get(name)).intValue(); String value = getSrcDataString(name); //.replace(".0", ""); return Integer.parseInt(value); }

must be

protected int getSrcDataInt(String name) { if (srcData.containsKey(name)) return ((Long)srcData.get(name)).intValue(); String value = getSrcDataString(name); //.replace(".0", ""); return Integer.parseInt(value); }

bejean commented 11 years ago

What is the issue with the Long to Integer cast ? Is there an issue with one of the source setting parameter ? I don't think so !

torhar commented 11 years ago

at least one source parameter is a Long, so cast to Integer fails, maybe only in our environment

bejean commented 11 years ago

Can you provide the xml export of your source setting (export function) ?

torhar commented 10 years ago

this issue got the label "bug", do you still need xml export of the source to investigate Long/Inter cast of Source.java?

bejean commented 10 years ago

Yes please.

torhar commented 10 years ago

(See attached file: 525288c536c04.xml)

bejean commented 10 years ago

Please send the file to contact@crawl-anywhere.com

bejean commented 10 years ago

Hi,

I don't reproduce these issues even with your export. Can you provide the exact scenario in order to each of these 2 issue ? Which source parameter is a long ? Which version of mongodb are you using ? Is it a 64 bits version ?

Regards.

torhar commented 10 years ago

$ mongod --version db version v2.2.3, pdfile version 4.5 Fri Oct 18 11:38:20 git version: nogitversion

64 bit

torhar commented 10 years ago

Fri Oct 18 13:29:11 CEST 2013 - ================================= Fri Oct 18 13:29:11 CEST 2013 - Crawler starting (version: 4.0.0) Fri Oct 18 13:29:11 CEST 2013 - Simultaneous sources crawled : 3 Fri Oct 18 13:29:11 CEST 2013 - account : 1 Fri Oct 18 13:29:11 CEST 2013 - Fri Oct 18 13:29:11 CEST 2013 - ================================= Fri Oct 18 13:29:11 CEST 2013 - Fri Oct 18 13:29:11 CEST 2013 - Sources to be crawled : 1 Fri Oct 18 13:29:11 CEST 2013 - Pushing source : 4 Fri Oct 18 13:29:11 CEST 2013 - Source data key-name: id_target Fri Oct 18 13:29:11 CEST 2013 - Source data key-class: class java.lang.Long Fri Oct 18 13:29:11 CEST 2013 - java.lang.Long cannot be cast to java.lang.Integer Fri Oct 18 13:29:11 CEST 2013 - >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fri Oct 18 13:29:11 CEST 2013 - >>>> Error = java.lang.Long cannot be cast to java.lang.String Fri Oct 18 13:29:11 CEST 2013 - = java.lang.Thread.run(Thread.java:662) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.connectors.Source.getSrcDataString(Source.java:142) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.connectors.Source.getSrcDataInt(Source.java:124) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.connectors.Source.getTargetId(Source.java:205) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.connectors.Connector.initializeInternal(Connector.java:50) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.connectors.web.WebConnector.initialize(WebConnector.java:79) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.ProcessorSource.call(ProcessorSource.java:55) Fri Oct 18 13:29:11 CEST 2013 - fr.eolya.crawler.ProcessorSource.call(ProcessorSource.java:20) Fri Oct 18 13:29:11 CEST 2013 - java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) Fri Oct 18 13:29:11 CEST 2013 - java.util.concurrent.FutureTask.run(FutureTask.java:138) Fri Oct 18 13:29:11 CEST 2013 - java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) Fri Oct 18 13:29:11 CEST 2013 - java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) Fri Oct 18 13:29:11 CEST 2013 - java.lang.Thread.run(Thread.java:662)

log produces with following code-snippet:

try { if (srcData.containsKey(name)) return ((Integer)srcData.get(name)).intValue(); } catch(Exception e) { logger.log("Source data key-name: " +name); logger.log("Source data key-class: " +srcData.get(name).getClass()); logger.log(e.getMessage()); }

bejean commented 10 years ago

Hi, Thank you for this trace. Did you setup something specific about target ? Did you created a target ? Did you change the target for your source ? Dominique

bejean commented 10 years ago

Please send me by email your file source.java.

bejean commented 10 years ago

I tried various things, but it is still impossible to reproduce. Can you provide an export of your mongodb database (without the pages* collections) ?

erik2e commented 10 years ago

Hello,

We have the same problem. Installation seems to be fine, and we entered our sources (~140), but crawling never starts, with the cast Exception mentionned in this issue.

Did you find any solution/workaround ?

Thanks

bejean commented 10 years ago

Can you provide me a mongodb export ?

erik2e commented 10 years ago

Thanks for your quick answer. I just sent the export to conact at crawl-anywhere.com.

bejean commented 10 years ago

Fixed