elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
965 stars 114 forks source link

Advice wanted, usage with oban #179

Closed bapti closed 3 years ago

bapti commented 3 years ago

Hi, I'm using crawly and I want to trigger crawls from oban jobs. I'm thinking of wrapping the call to starting a spider in a Task, is this the right approach? It's mainly so I could do something like await Spider within the scheduled job. I've read through the docs as best I can, but I'm a little bit of a novice when it comes to these parts of elixir so thought I'd ask.

Thanks for any help!

bapti commented 3 years ago

Hi, I tried the following to see if it would work, but it doesn't seem to. So I'm just defaulting to triggering the crawls periodically and not utilising any of the job logic within Oban, purely using it as a cron mechanism for triggering the crawls.

def perform(%Oban.Job{args: args}) do
    pid = self()

    Crawly.Engine.start_spider(Spiders.MySpider,
      on_spider_closed_callback: fn _spider_name, _crawl_id, reason ->
        IO.inspect("Added crawly callback executed")
        send(pid, {:crawly_finished, reason})
        :ok
      end
    )

    receive do
      {:crawly_finished, reason} ->
        IO.inspect("Crawl finished #{reason}")
        reason
    end

    :ok
  end
Ziinc commented 3 years ago

This method would mean that your oban job would wait until the crawl finishes, whcih might take quite a while. It isn't the best way, but its simple enough. Not too sure if there's a timeout for oban jobs, but this solution looks fine from my point of view.

Another possibility is to update the job status post crawl using the callback. So you complete the job and mark the crawl as incomplete, then update it afterwards as complete. This means that you store the status of the crawl in your db or something like that.

Ziinc commented 3 years ago

refer to updated comment.

kasvith commented 7 months ago

Hi @Ziinc

I've a similar usecase

  1. We have a list of urls to fetch stored in PG
  2. When a spider crawls a 1st page URL, it might discover multiple links it need to crawl through 1st page
  3. After crawlers finish crawling all urls(1st page and others) we need to mark this job as done
  4. So i need to pass this job id or something which i can use between spiders

Do you know how we can pass metadata?

Ziinc commented 7 months ago

Work to add this had stalled. If you really need this you can contact me privately and I can see what I can do.

kasvith commented 7 months ago

@Ziinc sent you an email to ty@tzeyiing.com