datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

NDA crawler fails with error #56

Open loj opened 5 years ago

loj commented 5 years ago

I'm attempting to use DataLad's NDA crawler for a dataset I'm trying to download, but I'm running into problems. Following the instructions in the datalad crawler docs, I ran the following:

$ datalad create -c text2git nda_crawler
[INFO   ] Creating a new annex repo at /data/BnB_USER/loj/downloads/nda_crawler 
[INFO   ] Running procedure cfg_text2git                                                                                                                                                       
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
create(ok): /data/BnB_USER/loj/downloads/nda_crawler (dataset)
$ datalad crawl-init --save --template nda collection=2274
[INFO   ] Creating a pipeline for the NDA bucket

However the crawl fails. :-(

$ datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
[INFO   ] Creating a pipeline for the NDA bucket 
[INFO   ] Running pipeline [[assign(assignments=<<{'filename': 'collecti...>>, interpolate=False), <datalad_crawler.nodes.annex.Annexificator object at 0x7f5dafec9320>], [crawl_mindar_images03(collection='2274'), continue_if(negate=False, re=True, values=<<{'url': 's3://(?P<buck...>>), <datalad_crawler.nodes.annex.Annexificator object at 0x7f5dafec9320>]] 
[ERROR  ] Failed to create the collection: Prompt dismissed.. [SecretService.py:get_preferred_collection:58] (InitError) 

I'm running datalad version 0.12.0rc5 and the latest master of datalad crawler.

One of my concerns is whether I'm using the correct information for the "collection". NDA requires that the user create a "package" for any downloads. So I've created my package to download this dataset, and I have the package identifier, but my understanding of this crawler is that it wants the dataset ID, not the package identifier (which I also tried, but it too failed with the same error)... The point is, I'm unsure if I'm doing the right thing here. Thoughts?

Thanks! --Laura

yarikoptic commented 5 years ago

well, the nda crawler was pretty much a prototype a years back, then the "NDA ways" of delivering content have changed ... even NDA authentication adapter is no longer working: https://github.com/datalad/datalad/issues/3674 . We had some initial dialog with @obenshaindw (and @agt24) on how datalad could (in the future RFing) to interface to NDA, but so far nobody had juice/time and needed use-case to progress forward. I feel like you have a use case? or it was just an example of no particular interest/need?

agt24 commented 5 years ago

It'd be good to revisit this. @yarikoptic do you have a record of the ticket number at https://ndar.zendesk.com ?

I can't find it for some reason

yarikoptic commented 5 years ago

I do not find any email among mine which relates to datalad on ndar.zendesk

loj commented 5 years ago

Thanks for the response. :-)

I feel like you have a use case? or it was just an example of no particular interest/need?

@yarikoptic Yeah, this is for a dataset I'm downloading at work. Over the next couple of months, I'll be downloading 2-4 datasets from the NDA. If you need more information about what we're doing, I can explain further.

Using the crawler to achieve this isn't critical, my fallback is to use NDAR/nda-tools to download the data.

yarikoptic commented 5 years ago

Ok, I guess just fallback for now

agt24 commented 5 years ago

I'll ask David about it next time I see him.

On Wed, Sep 25, 2019 at 7:15 AM Yaroslav Halchenko notifications@github.com wrote:

Ok, I guess just fallback for now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/datalad/datalad-crawler/issues/56?email_source=notifications&email_token=AB4BEWOA3AT6TJNSRFPKKKLQLNB3ZA5CNFSM4I2AU5DKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RQZEI#issuecomment-534973585, or mute the thread https://github.com/notifications/unsubscribe-auth/AB4BEWIEXBZAU4YSA5U2RW3QLNB3ZANCNFSM4I2AU5DA .

yarikoptic commented 4 years ago

@loj Did you establish some workflow to fetch datasets from NDA? One way (fix up datalad and/or datalad-crawler) or another (custom extension/set of scripts like for ukbiobank) it would be nice to have it available to wider audience.

loj commented 4 years ago

Unfortunately I haven't yet, but this is still on my to-do list. I hope to get to it soon, and will definitely share once I have something. :-)