apify / actor-templates

This project is the :house: home of Apify actor template projects to help users quickly get started.
https://apify.com/
25 stars 14 forks source link

Scrapy template: Explore the possibility of running more Spiders per Actor #202

Closed vdusek closed 8 months ago

vdusek commented 1 year ago

https://github.com/apify/actor-templates/blob/087b2dc4315e029e38b6282f7d312fc80c0c4e0d/templates/python-scrapy/src/main.py#L42:L45

honzajavorek commented 9 months ago

UPDATE: I abandoned trying to get multiple spiders working, instead I'm investing my time to implementing the monorepo approach, registering each spider as an individual actor:

async def main() -> None:
    actor_path = os.environ['ACTOR_PATH_IN_DOCKER_CONTEXT']  # e.g. juniorguru_plucker/jobs_startupjobs
    spider_module_name = f"{actor_path.replace('/', '.')}.spider"

    async with Actor:
        Actor.log.info(f'Actor {actor_path} is being executed…')
        settings = apply_apify_settings(get_project_settings())
        crawler = CrawlerProcess(settings, install_root_handler=False)
        Actor.log.info(f"Actor's spider: {spider_module_name}")
        crawler.crawl(importlib.import_module(spider_module_name).Spider)
        crawler.start()
Original Post ### Original Post I do something like this: ```python async def main() -> None: async with Actor: Actor.log.info('Actor is being executed...') actor_input = await Actor.get_input() or {} spider_names = set(source for source in actor_input.get('sources', ['all'])) if 'all' in spider_names: for path in Path(__file__).parent.glob('spiders/*.py'): if path.stem != '__init__': spider_names.add(path.stem) spider_names.remove('all') Actor.log.info(f"Executing spiders: {', '.join(spider_names)}") settings = apply_apify_settings(get_project_settings()) crawler = CrawlerProcess(settings, install_root_handler=False) for spider_name in spider_names: spider_module_name = f"{settings['NEWSPIDER_MODULE']}.{spider_name}" spider = importlib.import_module(spider_module_name) crawler.crawl(spider.Spider) crawler.start() ``` But for mysterious reasons, it doesn't work correctly. I'm getting this exception: ``` ValueError: Method 'parse_job' not found in: ``` Although my `startupjobs` spider has no `parse_job` method at all! That's a method of the other spider. I suspect either the asyncio or the apify sorcery causes the code of the two spiders to somehow mingle 🤯 My full proof of concept is here: https://github.com/juniorguru/plucker/blob/404f677f4748dfae5389072fc01b7d736abbc62f/juniorguru_plucker/main.py#L40 Any ideas on what could have go wrong?