Open joeyzhou98 opened 5 years ago
Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?
I don't know datalad-crawler's internals well. Poking around in the repo, I'd guess the way to do this would be with a recurse
node. pipelines/abstractsonline.py seems to provide the clearest example. But looking at modules like pipelines/{openfmri,crcns}.py, I'd guess the preferred design is to make the pipeline work at the individual dataset level and then define superdataset_pipeline
.
@yarikoptic will be able to give a more informed response.
Two comments not directly related to your question:
yeap, probably you would like to first establish a pipeline to create subdatasets (one per each zenodo dataset page) as @kyleam has pointed out. And then have each dataset crawled independently.
if you want/need to crawl into other pages you can provide matchers
to crawl_url
which would could be used for crawling multiple pages. See e.g. https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/crcns.py#L141 super dataset pipeline where we need to crawl multiple pages to identify all datasets.
It might already be answered, but from my knowledge I haven't found a way to crawl and download files from multiple sub pages from one main page. For example, here
We can see there are multiple datasets I want to download, however, there are no direct href download links on the page. I would need to click on a dataset I am interested in and then there is a download href link to download files:
Is there a way to define pipeline() so that it is able to crawl, starting from one main catalog page, to multiple sub pages in order to download files?