Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

ExtensionReferenceFilter over-zealously detects "extensions" in the middle of a path #2

Closed niels closed 8 years ago

niels commented 8 years ago

The current implementation of acceptReference will consider the last string following the last dot anywhere in the path (or perhaps even the full URL?) to be the file extension.

E.g. given a URL such as https://herimedia.com/norconex-test/this.is.not.a.file/test, the file extension is detected to be file/test.

I believe the correct implementation would only try to find a file extension within the last path segment, e.g. only within test.

If not one else claims this ticket, I will try to submit a patch late next week. As such, this serves as a reminder to myself :)

niels commented 8 years ago

FYI, I just verified that the "extension" is indeed looked for in the entire URL, including the host.

For example, if one wanted to block .com files, one would actually often also block .com domains (unless the path following that domain included a dot).

niels commented 8 years ago

Fixed in #4.

essiembre commented 8 years ago

I modified it slightly to better support non-URL references. Collector Core library being generic (not specific to the web).

Thanks for your contribution!