elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
969 stars 114 forks source link

General purpose links extractors #135

Closed oltarasenko closed 5 months ago

oltarasenko commented 3 years ago

One of the problems I am constantly seeing is a need to extract new URLs. And I am looking for a way to simplify it for me and other people as well.

I am thinking of writing a code which will:

  1. take a page body,
  2. extract all links from it
  3. filter these extracted links by a list of patterns provided by us.

For the end-user it would mean. I want my crawler to follow everything which has: "/blog" or "/product" on a given website. So you don't have to write request extractors (which is time-consuming).

Of course, I understand that extracting all links from a page and filtering them, is not ideal from the performance point of view. However, I would still want to have a helper like this.

Problems:

Any advises?

oltarasenko commented 3 years ago

@Ziinc in general I am quite close to the idea of bringing Floki back here... It can simplify these concerns quite a bit. From one side, I want to be independent, on the other hand, we can write quite a few pre-defined things:

  1. Automatic login form
  2. Auto new links extractions
  3. Maybe auto items extractions, etc
Ziinc commented 3 years ago

We can use the dependency injection pattern to avoid adding a specific html parser as a dep.

On the dev side, we set Floki/meseeks as a dev dependency, and on the user side, the user has to define a module containing callbacks that is is required.

For example, if the user wants to use automatic link extraction using glob patterns , we can construct an xpath based on a given glob pattern on the Crawly side and pass the final xpath to the ParserInterface.list_xpath/1 callback, where the user must set the reference to ParserInterface in the config.

If much work is going to be done in this these magic features, I think defining a protocol like how Plug.Conn does would give tremendous benefits

oltarasenko commented 3 years ago

Yes, I was thinking about it. It looks like it requires quite a lot of work to have adapters for two parsers we have now (as their APIs are different, e.g. functions names, XPath support, etc). It sounds like a bit of work. And still we need to add one of the backends.

I can play with something like Code.loaded? Floki to either allow using a parser or to raise an exception. However I don't see benefits comparing to just including Floki to the list od deps.

Ziinc commented 3 years ago

The onus for managing the html parsing dep should be on the end user, as managing adaptors for both libraries would be too much work on our side and too restrictive on the user side.

If we go with user defined adaptors, we won't have to manage conditional dep compilation, which seems quite tricky and troublesome when I did a forum search. It also makes these an opt in feature, which for many people they might not even use

oltarasenko commented 3 years ago

Sorry not quite understood you

Ziinc commented 3 years ago

I see three possible way to implement such helpers:

1. Through a user-defined parsing interface that implements required parsing callbacks

 # User's config

     config :crawly,
         parser: MyHtmlParser
         # ...

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has no html parser dependency, obtain user definedmodule and try to extract urls
        # obtain the parser module, defined in config with `parser: MyHtmlParser`
        parser = Crawly.Utils.get_settings(:parser)
        xpath = parse_glob_pattern(glob)
        # call function based on mfa at runtime, return requests
        apply(parser, :list_xpath,[xpath])
        |> build_requests()
     end
 end

 # User's source code
 # User's app depends on Floki/Meeseeks or Jason/Poison
 defmodule MyHtmlParser do
     @behaviour Crawly.Parser
     import Floki
     import Jason
     @impl true
     def list_xpath(bla), do: Floki.parse(bla)
     @impl true
     def find_json(bla), do: Jason.parse(bla)
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Pros:

Cons:

2. Through Crawly-defined parsing interface that uses a Crawly-decided html parser

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has dependency on Floki, use directly, do floki related stuff. Simplifying with example functions
        glob
        |> parse_glob_pattern()
        |> Floki.find_urls_from_given_pattern()
        |> build_requests
     end
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Cons:

3. Through Crawly-defined parsing interface that uses a user-decided html parser

    config :crawly,
        html_parser: Meeseeks,
        json_parser: Jason
        # ...

 # Crawly source code
 defmodule Crawly.Extractors do
     def extract_urls(body, glob) do
        # Crawly has no Meeseeks/Floki, need to check which html parser is given
        xpath = parse_glob_pattern(glob)

        # this may not actually compile on the user side, since they might not have Floki/Meeseeks
        # will need to do some conditional compilation magic to ensure code can compile
        cond do
            Code.loaded?(Floki) ->
                # floki specific code
                Floki.find_urls_from_given_pattern()
            Code.loaded?(Meeseeks) ->
                # meeseeks specific code
                Meeseeks.find_urls_from_given_pattern()
            true ->
                Logger.error("No supported html parser is provided and compiled.")
                []
        end
        |> build_requests
     end
 end

 # User's spider
 defmodule MySpider do
     def parse_item(response) do
     requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
     [requests: requests]
     end
 end

Cons:

Ziinc commented 3 years ago

In my replies, i was talking about why option 1 is preferable as compared to 2 and 3.

oltarasenko commented 3 years ago

Heh :(.

Actually I don't want to force people writing any extra code. E.g. adapters or anything like that. In any case the conversation was quite useful, as I think I will follow the hybrid idea.

So I see it done like this:

if Code.loaded?(Floki) do
   do_extract_urls(page)
else
  Logger.error("General purpose extractor relays on Floki")
end

I think it will be quite simple to start with. Then we can play a bit more with the idea of having builtin parsers.

Ziinc commented 3 years ago

No issues with the hybrid approach, it is what quite a few frameworks use for handling json parsing (phoenix for example, off the top of my head).

I only worry about maintenance, like the what-if scenario where there are breaking api changes in a library between versions. Then we'd have to maintain two different pieces of code for one library plus check for the api version to know which piece of code to use

oltarasenko commented 5 months ago

I will close this one, as it's been open for years, and no one has had time or need to lead the work.