Crawling multiple web pages

joeyzhou98 commented 5 years ago

It might already be answered, but from my knowledge I haven't found a way to crawl and download files from multiple sub pages from one main page. For example, here

We can see there are multiple datasets I want to download, however, there are no direct href download links on the page. I would need to click on a dataset I am interested in and then there is a download href link to download files:

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

kyleam commented 5 years ago

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

I don't know datalad-crawler's internals well. Poking around in the repo, I'd guess the way to do this would be with a recurse node. pipelines/abstractsonline.py seems to provide the clearest example. But looking at modules like pipelines/{openfmri,crcns}.py, I'd guess the preferred design is to make the pipeline work at the individual dataset level and then define superdataset_pipeline.

@yarikoptic will be able to give a more informed response.

Two comments not directly related to your question:

I'm assuming the goal is to create a pipeline that works with a zenodo dataset and then provide a datalad superdataset that contains a collection of zenodo datasets of interest, where the datasets of interest are much smaller than the 41,396 results you're showing in your screenshot.
I wondered whether zenodo has an API for downloading. It seems like they do, but it's in beta.

yarikoptic commented 5 years ago

yeap, probably you would like to first establish a pipeline to create subdatasets (one per each zenodo dataset page) as @kyleam has pointed out. And then have each dataset crawled independently.

if you want/need to crawl into other pages you can provide matchers to crawl_url which would could be used for crawling multiple pages. See e.g. https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/crcns.py#L141 super dataset pipeline where we need to crawl multiple pages to identify all datasets.

datalad / datalad-crawler

Crawling multiple web pages #54