Closed krystofbe closed 4 years ago
Hi @krystofbe. Thanks for the feedback. I will check. Will get back to you soon
Hi @oltarasenko!
What I did was writing a custom Crawly.Pipeline
Module that saves each entry with Repo.insert
. But this saves to disk nevertheless.
Hey @krystofbe, yes that's a way to go I think. (Also another option would be to import CSV file directly to the database).
What I did was writing a custom Crawly.PipelineModule that saves each entry with Repo.insert. But this saves to disk nevertheless.
Regarding the items pipeline. Is that the case, that the pipeline you have created does not do anything?
Right now it's a modified JSON Pipeline
@impl Crawly.Pipeline
def run(item, state) do
case Poison.encode(item) do
{:ok, new_item} ->
project = Jason.decode!(new_item)
Results.create_job(%{
"url" => project["url"],
"source" => project["source"],
"posted_at" => project["posted_at"],
"data" => project
})
{new_item, state}
{:error, reason} ->
Logger.info("Could not encode the following item: #{inspect(item)} into json,
reason: #{inspect(reason)}")
{false, state}
end
end
Results.create_job/1
inserts into the db
ok, did you add it to the pipelines in crawly settings? as it's suggested here? https://oltarasenko.github.io/crawly/#/?id=pipelines-module
Yep, this works very nicely this ways. The only downside about this method is that the csv-file is still created, eventhough I don't need it. And wouldn't want this file because I suppose it's getting very large and I would need to delete it regularly.
Ok, it's my bad. I need to think about how to improve it https://github.com/oltarasenko/crawly/blob/master/lib/crawly/data_storage/data_storage_worker.ex#L36-L47
The problem is that writing to a file, requires me to open/close the fd. From now, please store files in /tmp folder, and I will add a fix for the case soon!
Thanks for your effort ❤️
@krystofbe I have an idea how to improve the case. I will be releasing the next version of Crawly closer to the end of the week. Btw, what do you think about adding ECTO pipeline to Crawly? Probably it would be needed to generalize the case a bit, but I think it might be useful.
Hi @oltarasenko: I suggest that I'll look at your changes and the end of the week, update and modify the pipeline
Btw, what do you think about adding ECTO pipeline to Crawly? Probably it would be needed to generalize the case a bit, but I think it might be useful.
@oltarasenko the sheer number of scraped results might overwhelm the db without a job queue, as @krystofbe is probably using right now, as inferred from Results.job()
.
Absolutely, for large demands I'd probably use a Genstage with Producers and Consumers
@krystofbe We have modified the idea of DataStorage. Now it's possible just to remove the WriteToFile pipeline from the list of pipelines and you will not have anything stored on FS.
Also I would love to get a contribution for the Ecto pipeline, if available
Maybe it's better to generate the code required to write the Ecto pipeline instead of having it out of the box, since it will require the dev to wire it up anyway.
We could do an advanced genstage pipeline template, which would be generalised for other use cases
We could do an advanced genstage pipeline template, which would be generalised for other use cases
Just thinking about this more, instead of generating a template, maybe a GenStagePipeline protocol would good enough?
I am not sure about the GenStage. In general I would rather prefer not to use it.
So for now I would stick with just a pure pipeline without the GenStage...
Will include a guide on how to write an ecto pipeline in dev docs in PR #31
Docs on creating custom pipelines has been added, will be available in next release
Thanks for this awesome library <3
Is there any possibility to write the scraped items to a database with ecto instead of writing them to file?