I am struggling to understand on how I has store the results on a json file for each start url.
Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )
I am running the following command:
scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
with
FILES_STORE = "\output_data"
This creates several files without extension on that path.
So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.
Also I cannot find the
IMAGES_ENABLED in the settings file to stop downloading images
PS: I have not activated Splash as I do not have access to docker on my local laptop.
Hi, I have to say amazing tool.
I am struggling to understand on how I has store the results on a json file for each start url. Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )
I am running the following command:
scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5
withFILES_STORE = "\output_data"
This creates several files without extension on that path. So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.Also I cannot find the
IMAGES_ENABLED
in the settings file to stop downloading imagesPS: I have not activated Splash as I do not have access to docker on my local laptop.
Could you please shed some light on?