elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
982 stars 115 forks source link

Split logs #155

Closed oltarasenko closed 3 years ago

Ziinc commented 3 years ago

I think that passing around the spider name AND the craw_id seems unnecessary. I think passing around the crawl_id and using a lookup function to check the spiders stored in the engine would be a better approach. This way, the crawl_id functions as a primary key.

I think obtaining the spider module as needed from the engine (whether as an ets table or from genserver state) in the engine would be much easier, instead of passing the module name around. It would also make the runtime spiders PR easier to implement, by decoupling the running of spiders from the module.

Ziinc commented 3 years ago

It might also be beneficial to wrap Logger with a utility function, so that we can standardize the way the spider metadata is logged

oltarasenko commented 3 years ago

I think that passing around the spider name AND the craw_id seems unnecessary. I think passing around the crawl_id and using a lookup function to check the spiders stored in the engine would be a better approach. This way, the crawl_id functions as a primary key.

I have the same feelings. At first, I did not want to do this full switch from spider_name to crawl_id, but now as soon as I have a bit of time I will do this exercise.

I think obtaining the spider module as needed from the engine (whether as an ets table or from genserver state) in the engine would be much easier, instead of passing the module name around. It would also make the runtime spiders PR easier to implement, by decoupling the running of spiders from the module.

Yes, but I will start with something simple at first. Let me see.

It might also be beneficial to wrap Logger with a utility function, so that we can standardize the way the spider metadata is logged

Looks like a better way to do it is to call Logger.metadata(crawl_id: 1) in related processes.

oltarasenko commented 3 years ago

@Ziinc I have made another review of the code. Looks like the switch to the use of crawl_id will require quite a bit of work, which implies a change to how do we pick spiders or store things.

I also not quite sure what to do with Requests/DataStorage workers, which use a spider_name in order to avoid creating duplication of these processes.

I would rather defer this question to a separate unrelated discussion.

Ziinc commented 3 years ago

i was reviewing it midway when you merged it haha.

wrt the passing of the crawl_id around, i'll revisit it when i work on the runtime spiders. Right now it feels like a lot of room for error.

oltarasenko commented 3 years ago

Sorry for that :(. I will address the comments above on master :(.

Yeah, unfortunately, the switch to crawl_id looks a bit complex, however, I think we should do it.