TeamHG-Memex / undercrawler

A generic crawler
78 stars 25 forks source link

How to store urls and html content to json format? #83

Open AlexPapas opened 3 years ago

AlexPapas commented 3 years ago

Hi, I have to say amazing tool.

I am struggling to understand on how I has store the results on a json file for each start url. Currently i am getting binary files for each url within the domain which I have difficulties retrieving the information that I am seeking, (domain url, sub url, status code, html content or plain text )

I am running the following command:

scrapy crawl undercrawler -a url=https://www.bvrgroep.nl -s CLOSESPIDER_PAGECOUNT=5 with FILES_STORE = "\output_data" This creates several files without extension on that path. So it is hard for me to get my head around the 'UndercrawlerMediaPipeline' and how I can adjust it to store files in a readable format.

Also I cannot find the IMAGES_ENABLED in the settings file to stop downloading images

PS: I have not activated Splash as I do not have access to docker on my local laptop.

Could you please shed some light on?

AlexPapas commented 3 years ago

image