fredwu / crawler

A high performance web crawler / scraper in Elixir.
945 stars 91 forks source link

customization strategy #2

Closed sensiblearts closed 7 years ago

sensiblearts commented 7 years ago

Hi Fred, Advice? I looked at your code, and here's my understanding so far:

Every new url to be parsed gets a new child Worker spun up, then you cast an (opts) message to the worker, and the worker gets the url (and other options) from the opts.

The worker uses the url to |> fetch |> parse

Wherein (fetch), if the Policer validity test passes and the Recorder successfully stores the url of the page that we're about to fetch, the Fetcher retrieves the page.

Next (parse), the parser finds all the links in the body; at each link-find event ("OnLink," so to speak), the link_handler function is invoked; and by default, the link_handler == Dispatcher.dispatch, which in turn calls Crawler.crawl (recursively).

Now, thinking about my use cases (more later), for a given root domain I will be going 4 layers deep, but I will need different logic at each layer; and, for different root domains the logic will be different. By different "logic," I mean (I think) different Parser and Fetcher logic (how to find the link(s) for the next level, what to find in the body and record/persist).

My first thought of how I might do this is by adding a field to the opts keys. I.e., when Crawler.crawl pipes to |> Options.assign_url(url), I could use pattern matching on the url in the assign_url function to add (depending on root domain and parse depth -- the current state) new opts fields -- :parser_strategy and :recorder_strategy.

These new opts fields could hold values that correspond to states of a Fetcher (or Recorder) state machine and a Parser state machine. I.e., in fetch_url_200, where Recorder.store_page is called, that call will invoke one of many (pattern matched) functions; so, a lot of my domain-specific, level-specific logic would be implemented in Recorder.store_page, based on opts[:recorder_strategy].

In summary, if I followed what I describe above, the places I see where I would be adding my customization would be in:

Options.assign_url (Use business logic to set :recorder_strategy and :parser_strategy states, based on previous state and url. Hence, "OnLink" are FSM transitions.)

Recorder.store_page (Call one of multiple versions of do_store_page(), depending on :recorder_strategy state)

Parser.parse_links (Call one of multiple versions of do_parse_links(), depending on :parser_strategy state)


Thoughts? Thanks, David


also, fyi, to get it to build on Windows I had to bump :httpoison, "~> 0.13".

sensiblearts commented 7 years ago

I tried, essentially, what I describe (above), and it looks like it will work, but will think about the design patterns involved. (I'm new to functional prog.)

First, I added a new option:

  def assign_defaults(opts) do
    Keyword.merge([
      depth:      0,
      max_depths: max_depths(),
      timeout:    timeout(),
      save_to:    save_to(), 
      strategy_depth: 0       # added, D.A.
    ], opts)
  end

and I use this strategy_depth (and anther opt added later) to determine parsing and recording logic. I use strategy_depth rather than the existing depth, because the "logical" depth might not increase at the same rate as the "physical" depth, so to speak. Here's the addition of custom strategy:

# Parser.ex:

  def parse_links(body, opts, link_handler) do
    body
    |> Floki.find("a") 
    |> find_links_per_strategy(opts)   # added
    |> Enum.map(&parse_link(&1, opts, link_handler))
  end

  def find_links_per_strategy(links, opts) do
    links 
      |> do_find_links( %{domain_strategy: opts[:domain_strategy], strategy_depth: opts[:strategy_depth]} )
   end

# Fetcher.ex:

  defp fetch_url_200(body, opts) do
    with {:ok, _} <- Recorder.store_page(opts[:url], body),
         {:ok, _} <- Recorder.save_pageinfo_per_strategy(opts, body),  # added
         {:ok, _} <- snap_page(body, opts)
    do
      return_page(body, opts)
    end
  end

# Recorder.ex:

 def save_pageinfo_per_strategy(opts, body) do
    do_save_pageinfo_per_strategy(
      %{domain_strategy: opts[:domain_strategy], strategy_depth: opts[:strategy_depth]}, body)
  end

Note that I have to convert the keyword opts list to a map so that the arguments will pattern match correctly. (For passing the user-defined customization state in the opts, would it be better to use a Map than a keyword list for opts?)

Example implementation functions:

  def do_save_pageinfo_per_strategy(%{domain_strategy: "steampowered.com", strategy_depth: 2}, body) do
    IO.puts "++++++++++++ DESCRIPTION ++++++++++++++++"
    result =
      body
      |> Floki.find("div.game_description_snippet")
      |> IO.inspect
    {:ok, result}
  end

and

 def do_find_links(links, %{domain_strategy: "steampowered.com", strategy_depth: 1}) do
    links
      |> Enum.filter(fn(x) -> 
          {"a", atts, _} = x
          {_, href} = List.keyfind(atts, "href", 0)
          Regex.match?(~r/^http:\/\/store.steampowered.com\/app\/\d*\/[^?].*/, href) 
        end)
      |> Enum.map(fn(x) -> 
           {"a", atts, other} = x
          {_, href} = List.keyfind(atts, "href", 0)
          new_atts = List.keyreplace(atts, "href", 0, {"href", URI.parse(href) |> Map.put(:query, nil) |> URI.to_string})
          {"a", new_atts, other}
        end)
  end

I'll proceed to push in this direction. Also, seems to me that all the "do_" named implementation functions above can be in a single module for a single customization. Must be a functional design pattern for this.

I'm new to Elixir and functional prog, so this may be crappy code. But at least it will clarify my particular use case for you.

Cheers.

fredwu commented 7 years ago

Hi David,

Whilst this approach might work, given how early the project is, and how deep these custom methods are embedded, it will become very difficult to keep your custom version up to date.

Instead of hooking up at multiple places, how about using a custom link handler? e.g.

In worker.ex:

 def handle_cast(_req, state) do
   state
   |> Fetcher.fetch
-  |> Parser.parse
+  |> Parser.parse(YourCustomLinkHandler.handle_link/2)
   |> Parser.mark_processed

   {:noreply, state}
 end

Given this is much higher level, you can do a lot in your link handler, including your own logic as well as to delegate to the default Dispatcher.dispatch/2 for recursively fetching pages. This way your custom logic is fully encapsulated.

The link handler will accept two arguments, the url, and the metadata opts - you can pattern match on the url, and you can get information such as the crawling depth and referrer url from opts.

I also recommend having your own Registry / database to record any metadata, so that they don't interfere with the main Crawler.Store.DB, or at least namespace your own metadata.

Hope that makes things a lot easier for you. :)

sensiblearts commented 7 years ago

UPDATE: Ignore this. See next comment.

Thanks. I'll try what you describe.

The reason I did it my way (above) was that I thought I would lose context and ability to associate referrer page content with linked page content.

Now I realize that, in a custom link handler, I can just query the Store.DB with the referrer url to get the body that was the context of the current link being handled.

However, if I do that, I'll be calling the link handler multiple times (e.g., if there are 50 embedded links) all with the same referrer url, so I'll have to keep track of whether that referrer body has been processed (so I don't do the same thing 50 times). Might this be better?:

 def handle_cast(_req, state) do
   plugin_name = get_plugin_name_from_url( state[:url])
   plugin =  String.to_existing_atom("Elixir.Crawler.Strategies.#{plugin_name}")
  # or you could just hard-code the plugin module name.
   state
   |> Fetcher.fetch
   |> plugin.handle_body
   |> Parser.parse( plugin.handle_link/2 )
   |> Parser.mark_processed
    {:noreply, state}
 end

or this:

 def handle_cast(_req, state) do
   plugin_name = get_plugin_name_from_url( state[:url])
   plugin =  String.to_existing_atom("Elixir.Crawler.Strategies.#{plugin_name}")
  # or you could just hard-code the plugin module name.
   state
   |> Fetcher.fetch( plugin.handle_body/2 )
   |> Parser.parse( plugin.handle_link/2 )
   |> Parser.mark_processed
    {:noreply, state}
 end

where the handle_body function could be passed down to:

defp fetch_url_200(body, opts, handle_body) do
    with {:ok, _} <- Recorder.store_page(opts[:url], body),
         {:ok, _} <- handle_body(opts, body), 
         {:ok, _} <- snap_page(body, opts)
    do
      return_page(body, opts)
    end
  end

(Aside: how did you generate the red/green diff highlighting above?)

sensiblearts commented 7 years ago

So I implemented the above like this:

 def handle_cast(_req, state) do

    plugin = Crawler.Plugin.get_plugin_for_url( state[:url] )

    state
    |> Fetcher.fetch 
    |> plugin.handle_body
    |> Parser.parse( &plugin.handle_link/2 )
    |> Parser.mark_processed

    {:noreply, state}
  end

where the plugin behavior is defined:

defmodule Crawler.Plugin do

    @callback handle_link(link :: Tuple.t, opts :: List.t) :: any

    @callback handle_body(state :: Map.t) :: any

    def get_plugin_for_url(url) do
        plugin = cond do
            # Regex.match?(~r/^http:\/\/mycustom.domain.com/, url)  -> "MyPlugin"
            true                                                     -> "Default"
        end
        String.to_existing_atom("Elixir.Crawler.Plugin.#{plugin}") 
    end
end

where if you haven't defined a plugin, a default (that does nothing) is used.

defmodule Crawler.Plugin.Default do
    @behaviour Crawler.Plugin

    def handle_body( state) do
        do_handle_body( state, state.opts[:depth] )
    end

    def handle_link(nil, _opts) do
        # TODO: fixme
        IO.puts "nil link"
    end

    def handle_link(request, opts) do
        case request do
        {_, _link, _, url} -> 
            if a_url = url_to_handle(url, opts[:depth]) do
                Crawler.crawl(a_url, opts)
            end
        {_, url}           -> 
            if a_url = url_to_handle(url, opts[:depth]) do
                Crawler.crawl(a_url, opts)
            end
        end
    end

    defp do_handle_body( state, _depth) do
        IO.inspect state.page.body
        state
    end

    defp url_to_handle( url, _depth) do
        IO.inspect url
        url
    end
end
sensiblearts commented 7 years ago

It occurs to me that "handle_body" is really just (non-crawl-navigation) parsing, so perhaps a better interface is one of the following. (I'm learning elixir, so having fun experimenting.)

Any of these are working in my code:

    |> Fetcher.fetch 
    |> plugin.handle_body
    |> Parser.parse( &plugin.handle_link/2 )
    |> Parser.mark_processed

or

    |> Fetcher.fetch 
    |> Parser.parse( &plugin.handle_link/2, &plugin.handle_body/1 )
    |> Parser.mark_processed

or

    |> Fetcher.fetch 
    |> Parser.parse( plugin )
    |> Parser.mark_processed

Personally, the last one seems cleanest.

The pattern matching to allow any of the above to do the same thing:

def parse(page, link_handler \\ &Dispatcher.dispatch(&1, &2))

  def parse(%{page: page, opts: opts}, link_handler, body_handler) do
    body_handler.(%{page: page, opts: opts})
    parse_links(page.body, opts, link_handler)
    page
  end

  def parse(%{page: page, opts: opts}, link_handler) when is_function(link_handler) do
    parse_links(page.body, opts, link_handler)
    page
  end

  def parse(%{page: page, opts: opts}, plugin) do
    plugin.handle_body(%{page: page, opts: opts})
    parse_links(page.body, opts, &plugin.handle_link/2)
    page
  end

Thoughts?

fredwu commented 7 years ago

I still think the cleanest approach would be to simply pass in a custom link handler only. Given that your requirement is to handle links, as well as your own parsing logic, you can implement your own parser and have it called by your link handler, this way the only hook into Crawler's flow is your link handler, thus making updates easier.

So basically, as suggest before, you can hook up your own link handler in worker.ex:

 def handle_cast(_req, state) do
   state
   |> Fetcher.fetch
-  |> Parser.parse
+  |> Parser.parse(YourCustomLinkHandler.handle_link/2)
   |> Parser.mark_processed

   {:noreply, state}
 end

Then your custom link handler and parser could be something like this:

defmodule YourCustomLinkHandler do
  def handle_link(url, opts) do
    with {:ok, opts} <- link_needs_custom_parsing(url, opts)
         {:ok, page} <- Crawler.Store.find(url)
    do
      YourCustomParser.parse(page, opts)
    else
      # if a link doesn't need custom parsing, then move on
      _ -> Crawler.Dispatcher.dispatch(url, opts)
    end
  end

  defp link_needs_custom_parsing(url, opts) do
    # determine whether the link needs custom parsing
  end
end

defmodule YourCustomParser do
  def parse(page, opts) do
    # your own parsing logic
    # 
    # when you're done parsing, and require crawling further links,
    # call `Crawler.Dispatcher.dispatch/2`
    # 
    # just make sure you return `page` at the end so
    # `Crawler.Parser.mark_processed/1` can mark it
  end
end

Does that make sense? :)


P.S. When you use ``` to quote blocks of code, you can use ```diff or ```elixir, etc for syntax highlighting.

sensiblearts commented 7 years ago

That makes sense, given one assumption: The very first page that is fetched does not need custom parsing; but in my case, it does.

But in reading your reply, I realized an obvious solution using just the link handler.

Instead of starting my client with an initial call to Crawler.crawl..., I could start it off with an initial fetch in a new Crawler.first_fetch():

defmodule Crawler do
 #...
  def first_fetch(url) do
    opts = [] |> Options.assign_defaults |> Options.assign_url(url)
    page = Fetcher.fetch( opts )
    YourCustomParser.parse( page, opts )
  end

The link_handler would take over from there, and call crawl all the way down.

Make sense?

Or, as I'm typing, I realize another, simpler solution: create a dummy html text with my root url in it, and crawl the dummy.

I guess either of the above will work. Which would you suggest?

Either way, I'll use just the link_handler as you originally suggested. I guess you can close this.

Thank you for your patience! Having a lot of fund learning elixir.

fredwu commented 7 years ago

I see, yeah in that case I'd probably prefer the second approach so you have less code to write. :P

Glad to help, please feel free to re-open this issue if you've got more questions.