datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

crawl http index(es) helper #76

Open yarikoptic opened 6 years ago

yarikoptic commented 6 years ago

What is the problem?

In its inception datalad crawler had some logic on how to crawl regular http indexes for the majority of the cases. Current design in crawl_url allowed for greater flexibility with specifying arbitrary set of matchers but lacks pre-crafter "universal" index traversal helper. Typically it is just a matter or following URLs which end with / and lead to subdirectories (e.g. a standard apache index as on http://data.pymvpa.org/). I think we need that one and it should become smart(er) to deal with obscure indexes as well, such as of https://afni.nimh.nih.gov/pub/dist/edu/data/ where subdirectories are "annotated" only in the first column with text, but in reality each of those urls redirects to actual location:

$> wget --max-redirect=0 -S https://afni.nimh.nih.gov/pub/dist/edu/data/CD 
--2018-01-24 11:14:11--  https://afni.nimh.nih.gov/pub/dist/edu/data/CD
Resolving afni.nimh.nih.gov (afni.nimh.nih.gov)... 156.40.187.114, 2607:f220:419:4103::114
Connecting to afni.nimh.nih.gov (afni.nimh.nih.gov)|156.40.187.114|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 301 Moved Permanently
  Date: Wed, 24 Jan 2018 16:14:11 GMT
  Server: Apache
  Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
  Location: https://afni.nimh.nih.gov/pub/dist/edu/data/CD/
  Content-Length: 255
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/html; charset=iso-8859-1
Location: https://afni.nimh.nih.gov/pub/dist/edu/data/CD/ [following]
0 redirections exceeded.

so our helper could first request the headers for every file and analyze them before yielding or recursing down. Here is a shot at how could be done with datalad:

In [20]: url = 'https://afni.nimh.nih.gov/pub/dist/edu/data/CD'

In [21]: from datalad.downloaders.providers import Providers
    ...: providers = Providers.from_config_files()

In [22]: downloader = providers.get_provider(url).get_downloader(url)

In [23]: try:
    ...:     ses = downloader.get_downloader_session(url, allow_redirects=False)
    ...: except Exception as exc:
    ...:     pass
    ...: 

In [24]: exc.url
Out[24]: u'https://afni.nimh.nih.gov/pub/dist/edu/data/CD/'

so we got a new url which already signals a subdirectory. But in general we also could query it as well to see if not redirecting again (i.e. probably we should follow all redirects for the "final url")

In [25]: ses = downloader.get_downloader_session(exc.url, allow_redirects=False)

In [26]: ses.headers
Out[26]: {'Content-Length': '1029', 'Accept-Ranges': 'bytes', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Keep-Alive': 'timeout=5, max=100', 'Server': 'Apache', 'Last-Modified': 'Mon, 27 Nov 2017 18:07:34 GMT', 'Connection': 'Keep-Alive', 'ETag': '"405-55efac6d6b498"', 'Date': 'Wed, 24 Jan 2018 16:16:58 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Url-Filename': u''}

actually altogether we could just use the returned url if we allow redirects ;)

In [30]: ses = downloader.get_downloader_session(url)

In [31]: ses.url
Out[31]: u'https://afni.nimh.nih.gov/pub/dist/edu/data/CD/'