elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
986 stars 116 forks source link

Dumping data to DB instead of a file #16

Closed krystofbe closed 4 years ago

krystofbe commented 5 years ago

Thanks for this awesome library <3

Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?

oltarasenko commented 5 years ago

Hi @krystofbe. Thanks for the feedback. I will check. Will get back to you soon

krystofbe commented 5 years ago

Hi @oltarasenko! What I did was writing a custom Crawly.PipelineModule that saves each entry with Repo.insert. But this saves to disk nevertheless.

oltarasenko commented 5 years ago

Hey @krystofbe, yes that's a way to go I think. (Also another option would be to import CSV file directly to the database).

What I did was writing a custom Crawly.PipelineModule that saves each entry with Repo.insert. But this saves to disk nevertheless.

Regarding the items pipeline. Is that the case, that the pipeline you have created does not do anything?

krystofbe commented 5 years ago

Right now it's a modified JSON Pipeline

  @impl Crawly.Pipeline
  def run(item, state) do
    case Poison.encode(item) do
      {:ok, new_item} ->
        project = Jason.decode!(new_item)

        Results.create_job(%{
          "url" => project["url"],
          "source" => project["source"],
          "posted_at" => project["posted_at"],
          "data" => project
        })

        {new_item, state}

      {:error, reason} ->
        Logger.info("Could not encode the following item: #{inspect(item)} into json,
          reason: #{inspect(reason)}")

        {false, state}
    end
  end

Results.create_job/1 inserts into the db

oltarasenko commented 5 years ago

ok, did you add it to the pipelines in crawly settings? as it's suggested here? https://oltarasenko.github.io/crawly/#/?id=pipelines-module

krystofbe commented 5 years ago

Yep, this works very nicely this ways. The only downside about this method is that the csv-file is still created, eventhough I don't need it. And wouldn't want this file because I suppose it's getting very large and I would need to delete it regularly.

oltarasenko commented 5 years ago

Ok, it's my bad. I need to think about how to improve it https://github.com/oltarasenko/crawly/blob/master/lib/crawly/data_storage/data_storage_worker.ex#L36-L47

The problem is that writing to a file, requires me to open/close the fd. From now, please store files in /tmp folder, and I will add a fix for the case soon!

krystofbe commented 5 years ago

Thanks for your effort ❤️

oltarasenko commented 5 years ago

@krystofbe I have an idea how to improve the case. I will be releasing the next version of Crawly closer to the end of the week. Btw, what do you think about adding ECTO pipeline to Crawly? Probably it would be needed to generalize the case a bit, but I think it might be useful.

krystofbe commented 5 years ago

Hi @oltarasenko: I suggest that I'll look at your changes and the end of the week, update and modify the pipeline

Ziinc commented 5 years ago

Btw, what do you think about adding ECTO pipeline to Crawly? Probably it would be needed to generalize the case a bit, but I think it might be useful.

@oltarasenko the sheer number of scraped results might overwhelm the db without a job queue, as @krystofbe is probably using right now, as inferred from Results.job().

krystofbe commented 5 years ago

Absolutely, for large demands I'd probably use a Genstage with Producers and Consumers

oltarasenko commented 4 years ago

@krystofbe We have modified the idea of DataStorage. Now it's possible just to remove the WriteToFile pipeline from the list of pipelines and you will not have anything stored on FS.

Also I would love to get a contribution for the Ecto pipeline, if available

Ziinc commented 4 years ago

Maybe it's better to generate the code required to write the Ecto pipeline instead of having it out of the box, since it will require the dev to wire it up anyway.

We could do an advanced genstage pipeline template, which would be generalised for other use cases

Ziinc commented 4 years ago

We could do an advanced genstage pipeline template, which would be generalised for other use cases

Just thinking about this more, instead of generating a template, maybe a GenStagePipeline protocol would good enough?

oltarasenko commented 4 years ago

I am not sure about the GenStage. In general I would rather prefer not to use it.

  1. DataStorageWorker is one process. So I would not expect one crawl to kill the database
  2. If multiple spiders are running - yes, it might cause the issue with database, (however it might cause issues to filesystem as well, if we're writing a lot on HDD).

So for now I would stick with just a pure pipeline without the GenStage...

Ziinc commented 4 years ago

Will include a guide on how to write an ecto pipeline in dev docs in PR #31

Ziinc commented 4 years ago

Docs on creating custom pipelines has been added, will be available in next release