Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

urlNormalizer normalizations add removeTrailingHash #331

Closed aleha84 closed 7 years ago

aleha84 commented 7 years ago

Need additional feature like removeTrailingQuestionMark and removeTrailingSlash named removeTrailingHash.

Removes trailing hash sign ("#"). http://www.example.com/display# → http://www.example.com/display

Without it, now crawler creates item for each of pages, creating dubles.

essiembre commented 7 years ago

This feature is now present in the latest snapshot release. Please confirm.

Keep in mind the GenericURLNormalizer lets you use regular expressions as well to modify URLs as you see fit.

aleha84 commented 7 years ago
ERROR - Could not apply normalization "removeTrailingHash".
java.lang.NoSuchMethodException: No such accessible method: removeTrailingHash() on object: com.norconex.commons.lang.url.URLNormalizer
    at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:239)
    at org.apache.commons.lang3.reflect.MethodUtils.invokeExactMethod(MethodUtils.java:208)
    at com.norconex.collector.http.url.impl.GenericURLNormalizer.normalizeURL(GenericURLNormalizer.java:213)
    at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline$URLNormalizerStage.executeStage(HttpQueuePipeline.java:125)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeQueuePipeline(HttpCrawler.java:250)
    at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLsRegular(HttpCrawler.java:151)
    at com.norconex.collector.http.crawler.HttpCrawler.queueStartURLs(HttpCrawler.java:137)
    at com.norconex.collector.http.crawler.HttpCrawler.prepareExecution(HttpCrawler.java:128)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:216)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
essiembre commented 7 years ago

Are you using the new snapshot entirely? Some dependencies were updated that contribute to this fix, so you can't update just the collector jar.

aleha84 commented 7 years ago

I have downloaded http collector snapshot zip. Which dependencies arr needed also?

essiembre commented 7 years ago

What is this one? norconex-collector-http-2.7.0-20170328.185911-15.zip

Do you overwrite a previous installation or started fresh? Maybe you have duplicate jars (same name, but different version)? The zip has an updated version of this file: norconex-commons-lang-1.13.0-SNAPSHOT.jar which you need (your error suggests you do not have the latest).

aleha84 commented 7 years ago

i see the propblem. After unzipping collector, i also install latest snapshot of elastic commiter. And it overrides this norconex-commons-lang-1.13.0-20170328.184247-17.jar file to norconex-commons-lang-1.13.0-SNAPSHOT.jar version from (02.03.2017 23:48)

If i choose manual files override regime, while installing commiter, and check every lib, then is is ok. And no exceptions in log.

essiembre commented 7 years ago

Did you use the committer install script? If so, I will try to fix it to better handle these type of conflicts. I am otherwise closing since the original request has been implemented. Thanks for confirming.

aleha84 commented 7 years ago

Yes, commiter install script.

essiembre commented 7 years ago

FYI, the collectors and committer snapshot versions were updated so their versioning is no longer time stamped so it should eliminate this issue in the future.