Closed safwank closed 12 months ago
Heya, if I understand your requirement correctly, you don't need to "track" the URLs. The filter is called on every URL. You pass in a module because the filter needs to conform to the spec/behaviour: https://github.com/fredwu/crawler/blob/f9426764c793a480b5f2cef45261d42ee85fc6a1/lib/crawler/fetcher/url_filter.ex
I think haven't done a good job explaining the situation.
What I'm trying to do is create a custom filter that limits the URLs to a specific domain, e.g. example.com. I know I can do the following
defmodule CustomFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
def filter("http://example.com" <> _, _opts), do: {:ok, true}
def filter(_, _), do: {:ok, false}
end
Crawler.crawl("http://example.com", url_filter: CustomFilter)
But what if I want to crawl a different domain (some_domain
) without hardcoding it, e.g.
defmodule CustomFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
def filter(url, _opts) do
if String.starts_with?(url, some_domain) do
{:ok, true}
else
{:ok, false}
end
end
end
Crawler.crawl(some_domain, url_filter: CustomFilter)
Also from what I've seen, the opts.url
option that gets passed to filter/2
is the same as the first argument (url
) passed to the same function. I was expecting the former to match the url
passed to Crawler.crawl/1
but I was wrong.
I hope I'm making more sense now :).
@safwank i need something similar and found that this works well
def filter(url, opts) do
res = cond do
String.starts_with?(url, ".") -> true
String.starts_with?(url, "/") -> true
Map.get(opts, :referrer_url) === nil -> true
true ->
URI.parse(opts.referrer_url) |> Map.get(:host) == URI.parse(url) |> Map.get(:host)
end
{:ok, res}
end
In my case, I only filter for urls that share the same domain as the original crawl(
I'm trying to create a custom
UrlFilter
that lets through all URLs that start with the base URL passed toCrawler.crawl/2
. I know I can useRegistry
orAgent
to track this value, but is there a better way? Theurl_filter
option only accepts a module but not afun
.