Closed oltarasenko closed 5 months ago
@Ziinc in general I am quite close to the idea of bringing Floki back here... It can simplify these concerns quite a bit. From one side, I want to be independent, on the other hand, we can write quite a few pre-defined things:
We can use the dependency injection pattern to avoid adding a specific html parser as a dep.
On the dev side, we set Floki/meseeks as a dev dependency, and on the user side, the user has to define a module containing callbacks that is is required.
For example, if the user wants to use automatic link extraction using glob patterns , we can construct an xpath based on a given glob pattern on the Crawly side and pass the final xpath to the ParserInterface.list_xpath/1
callback, where the user must set the reference to ParserInterface
in the config.
If much work is going to be done in this these magic features, I think defining a protocol like how Plug.Conn does would give tremendous benefits
Yes, I was thinking about it. It looks like it requires quite a lot of work to have adapters for two parsers we have now (as their APIs are different, e.g. functions names, XPath support, etc). It sounds like a bit of work. And still we need to add one of the backends.
I can play with something like Code.loaded? Floki
to either allow using a parser or to raise an exception. However I don't see benefits comparing to just including Floki to the list od deps.
The onus for managing the html parsing dep should be on the end user, as managing adaptors for both libraries would be too much work on our side and too restrictive on the user side.
If we go with user defined adaptors, we won't have to manage conditional dep compilation, which seems quite tricky and troublesome when I did a forum search. It also makes these an opt in feature, which for many people they might not even use
Sorry not quite understood you
I see three possible way to implement such helpers:
# User's config
config :crawly,
parser: MyHtmlParser
# ...
# Crawly source code
defmodule Crawly.Extractors do
def extract_urls(body, glob) do
# Crawly has no html parser dependency, obtain user definedmodule and try to extract urls
# obtain the parser module, defined in config with `parser: MyHtmlParser`
parser = Crawly.Utils.get_settings(:parser)
xpath = parse_glob_pattern(glob)
# call function based on mfa at runtime, return requests
apply(parser, :list_xpath,[xpath])
|> build_requests()
end
end
# User's source code
# User's app depends on Floki/Meeseeks or Jason/Poison
defmodule MyHtmlParser do
@behaviour Crawly.Parser
import Floki
import Jason
@impl true
def list_xpath(bla), do: Floki.parse(bla)
@impl true
def find_json(bla), do: Jason.parse(bla)
end
# User's spider
defmodule MySpider do
def parse_item(response) do
requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
[requests: requests]
end
end
Pros:
Cons:
# Crawly source code
defmodule Crawly.Extractors do
def extract_urls(body, glob) do
# Crawly has dependency on Floki, use directly, do floki related stuff. Simplifying with example functions
glob
|> parse_glob_pattern()
|> Floki.find_urls_from_given_pattern()
|> build_requests
end
end
# User's spider
defmodule MySpider do
def parse_item(response) do
requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
[requests: requests]
end
end
Cons:
config :crawly,
html_parser: Meeseeks,
json_parser: Jason
# ...
# Crawly source code
defmodule Crawly.Extractors do
def extract_urls(body, glob) do
# Crawly has no Meeseeks/Floki, need to check which html parser is given
xpath = parse_glob_pattern(glob)
# this may not actually compile on the user side, since they might not have Floki/Meeseeks
# will need to do some conditional compilation magic to ensure code can compile
cond do
Code.loaded?(Floki) ->
# floki specific code
Floki.find_urls_from_given_pattern()
Code.loaded?(Meeseeks) ->
# meeseeks specific code
Meeseeks.find_urls_from_given_pattern()
true ->
Logger.error("No supported html parser is provided and compiled.")
[]
end
|> build_requests
end
end
# User's spider
defmodule MySpider do
def parse_item(response) do
requests = Crawly.Extractors.extract_urls(response.body, "/products/*")
[requests: requests]
end
end
Cons:
In my replies, i was talking about why option 1 is preferable as compared to 2 and 3.
Heh :(.
Actually I don't want to force people writing any extra code. E.g. adapters or anything like that. In any case the conversation was quite useful, as I think I will follow the hybrid idea.
So I see it done like this:
if Code.loaded?(Floki) do
do_extract_urls(page)
else
Logger.error("General purpose extractor relays on Floki")
end
I think it will be quite simple to start with. Then we can play a bit more with the idea of having builtin parsers.
No issues with the hybrid approach, it is what quite a few frameworks use for handling json parsing (phoenix for example, off the top of my head).
I only worry about maintenance, like the what-if scenario where there are breaking api changes in a library between versions. Then we'd have to maintain two different pieces of code for one library plus check for the api version to know which piece of code to use
I will close this one, as it's been open for years, and no one has had time or need to lead the work.
One of the problems I am constantly seeing is a need to extract new URLs. And I am looking for a way to simplify it for me and other people as well.
I am thinking of writing a code which will:
For the end-user it would mean. I want my crawler to follow everything which has: "/blog" or "/product" on a given website. So you don't have to write request extractors (which is time-consuming).
Of course, I understand that extracting all links from a page and filtering them, is not ideal from the performance point of view. However, I would still want to have a helper like this.
Problems:
Any advises?