datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

Save results from crawler in subdirectories #64

Closed mathdugre closed 4 years ago

mathdugre commented 4 years ago

I'm writing a template for to crawl the Loris candidate API and want my results to be store in multiples sub-directories.

candidates/001/  # Datalad dataset
     |____> .datalad/
     |____> .git/
     |____> images/
     |          |___> img_1
     |          |___> img_2
     |____> instruments/
                |___> results_1
                |___> results_2
                |___> results_3

I haven't found a way to save the results in different sub-directories.

I tried to use different Annexificator nodes and passing them the path of the subdirectory to save a type of result in, however, I get the following error: [ERROR ] Running the pipeline function resulted in some_path/candidates/001/images [gitrepo.py:__init__:721].FYI this pipeline only takes the following args: ['url'] [crawl_init.py:__call__:96] (RuntimeError)

yarikoptic commented 4 years ago

sorry for the lack of documentation. do you have your own crawler "template" or use one of ours? Annexificator does care about path key in the record you would pass to its call, so you should be able to provide path to store filename under. e.g. see a pipeline for balsa: https://github.com/datalad/datalad-crawler/blob/master/datalad_crawler/pipelines/balsa.py#L171

mathdugre commented 4 years ago

Hi @yarikoptic I am creating my template. I based my template from another one and just realized they create their Annexificator with create=False as parameter thus sub-directory was not created. Thank you for your help!

yarikoptic commented 4 years ago

I will consider issue resolved, let me know if it is not