Closed aleha84 closed 7 years ago
This feature is now present in the latest snapshot release. Please confirm.
Keep in mind the GenericURLNormalizer lets you use regular expressions as well to modify URLs as you see fit.
ERROR - Could not apply normalization "removeTrailingHash".
java.lang.NoSuchMethodException: No such accessible method: removeTrailingHash() on object: com.norconex.commons.lang.url.URLNormalizer
at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:239)
at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:208)
at com.norconex.collector.http.url.impl.GenericURLNormalizer.normalizeURL(GenericURLNormalizer.java:213)
at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline$URLNormalizerStage.executeStage(HttpQueuePipeline.java:125)
at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:250)
at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:151)
at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:137)
at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:128)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Are you using the new snapshot entirely? Some dependencies were updated that contribute to this fix, so you can't update just the collector jar.
I have downloaded http collector snapshot zip. Which dependencies arr needed also?
What is this one? norconex-collector-http-2.7.0-20170328.185911-15.zip
Do you overwrite a previous installation or started fresh? Maybe you have duplicate jars (same name, but different version)? The zip has an updated version of this file: norconex-commons-lang-1.13.0-SNAPSHOT.jar
which you need (your error suggests you do not have the latest).
i see the propblem. After unzipping collector, i also install latest snapshot of elastic commiter. And it overrides this norconex-commons-lang-1.13.0-20170328.184247-17.jar file to norconex-commons-lang-1.13.0-SNAPSHOT.jar version from (02.03.2017 23:48)
If i choose manual files override regime, while installing commiter, and check every lib, then is is ok. And no exceptions in log.
Did you use the committer install script? If so, I will try to fix it to better handle these type of conflicts. I am otherwise closing since the original request has been implemented. Thanks for confirming.
Yes, commiter install script.
FYI, the collectors and committer snapshot versions were updated so their versioning is no longer time stamped so it should eliminate this issue in the future.
Need additional feature like removeTrailingQuestionMark and removeTrailingSlash named removeTrailingHash.
Removes trailing hash sign ("#"). http://www.example.com/display# → http://www.example.com/display
Without it, now crawler creates item for each of pages, creating dubles.