datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

Crawling multiple web pages #54

Open joeyzhou98 opened 4 years ago

joeyzhou98 commented 4 years ago

It might already be answered, but from my knowledge I haven't found a way to crawl and download files from multiple sub pages from one main page. For example, here

image We can see there are multiple datasets I want to download, however, there are no direct href download links on the page. I would need to click on a dataset I am interested in and then there is a download href link to download files:

image

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

kyleam commented 4 years ago

Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?

I don't know datalad-crawler's internals well. Poking around in the repo, I'd guess the way to do this would be with a recurse node. pipelines/abstractsonline.py seems to provide the clearest example. But looking at modules like pipelines/{openfmri,crcns}.py, I'd guess the preferred design is to make the pipeline work at the individual dataset level and then define superdataset_pipeline.

@yarikoptic will be able to give a more informed response.

Two comments not directly related to your question:

yarikoptic commented 4 years ago

yeap, probably you would like to first establish a pipeline to create subdatasets (one per each zenodo dataset page) as @kyleam has pointed out. And then have each dataset crawled independently.

if you want/need to crawl into other pages you can provide matchers to crawl_url which would could be used for crawling multiple pages. See e.g. https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/crcns.py#L141 super dataset pipeline where we need to crawl multiple pages to identify all datasets.