In its inception datalad crawler had some logic on how to crawl regular http indexes for the majority of the cases. Current design in crawl_url allowed for greater flexibility with specifying arbitrary set of matchers but lacks pre-crafter "universal" index traversal helper. Typically it is just a matter or following URLs which end with / and lead to subdirectories (e.g. a standard apache index as on http://data.pymvpa.org/). I think we need that one and it should become smart(er) to deal with obscure indexes as well, such as of https://afni.nimh.nih.gov/pub/dist/edu/data/ where subdirectories are "annotated" only in the first column with text, but in reality each of those urls redirects to actual location:
so our helper could first request the headers for every file and analyze them before yielding or recursing down. Here is a shot at how could be done with datalad:
In [20]: url = 'https://afni.nimh.nih.gov/pub/dist/edu/data/CD'
In [21]: from datalad.downloaders.providers import Providers
...: providers = Providers.from_config_files()
In [22]: downloader = providers.get_provider(url).get_downloader(url)
In [23]: try:
...: ses = downloader.get_downloader_session(url, allow_redirects=False)
...: except Exception as exc:
...: pass
...:
In [24]: exc.url
Out[24]: u'https://afni.nimh.nih.gov/pub/dist/edu/data/CD/'
so we got a new url which already signals a subdirectory. But in general we also could query it as well to see if not redirecting again (i.e. probably we should follow all redirects for the "final url")
In [25]: ses = downloader.get_downloader_session(exc.url, allow_redirects=False)
In [26]: ses.headers
Out[26]: {'Content-Length': '1029', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Keep-Alive': 'timeout=5, max=100', 'Server': 'Apache', 'Last-Modified': 'Mon, 27 Nov 2017 18:07:34 GMT', 'Connection': 'Keep-Alive', 'ETag': '"405-55efac6d6b498"', 'Date': 'Wed, 24 Jan 2018 16:16:58 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Url-Filename': u''}
actually altogether we could just use the returned url if we allow redirects ;)
In [30]: ses = downloader.get_downloader_session(url)
In [31]: ses.url
Out[31]: u'https://afni.nimh.nih.gov/pub/dist/edu/data/CD/'
What is the problem?
In its inception datalad crawler had some logic on how to crawl regular http indexes for the majority of the cases. Current design in
crawl_url
allowed for greater flexibility with specifying arbitrary set of matchers but lacks pre-crafter "universal" index traversal helper. Typically it is just a matter or following URLs which end with/
and lead to subdirectories (e.g. a standard apache index as on http://data.pymvpa.org/). I think we need that one and it should become smart(er) to deal with obscure indexes as well, such as of https://afni.nimh.nih.gov/pub/dist/edu/data/ where subdirectories are "annotated" only in the first column with text, but in reality each of those urls redirects to actual location:so our helper could first request the headers for every file and analyze them before yielding or recursing down. Here is a shot at how could be done with datalad:
so we got a new url which already signals a subdirectory. But in general we also could query it as well to see if not redirecting again (i.e. probably we should follow all redirects for the "final url")
actually altogether we could just use the returned url if we allow redirects ;)