commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
26 stars 2 forks source link

Add param to fast.urlfilter to filter based on length of the URL #27

Closed jnioche closed 10 months ago

jnioche commented 11 months ago

see https://github.com/commoncrawl/ia-web-commons/issues/32

We should already have the conf for it

urlfilter.fast.url.path.max.length1024 urlfilter.fast.url.pathquery.max.length2048

This needs to be done in Nutch and merged back

jnioche commented 11 months ago

See https://www.baeldung.com/cs/max-url-length

jnioche commented 11 months ago

https://issues.apache.org/jira/browse/NUTCH-3025

jnioche commented 11 months ago

https://github.com/apache/nutch/pull/796

jnioche commented 11 months ago

The PR uses urlfilter.fast.url.query.max.length instead of urlfilter.fast.url.pathquery.max.length

jnioche commented 11 months ago

Assigned to @sebastian-nagel to review the PR in Nutch land, merge if appropriate then merge back into our cc-nutch repo