Generic framework for crawling data providers with versions

There are two aspects:

Versioning

ATM some crawling pipelines do care/version datasets, e.g.

openfmri is quite an "advanced" (abomination), which actually produced quite wonderful git history from merely a list of tarballs published on the website. See git history of e.g. ///openfmri/ds000001. To make it happen, that crawler pipeline relies on
- Annexificator.commit_versions which does the magic of separating all staged files into separate commits depending on the versioning information found in their filenames
- nowhere documented ability of pipelines to "loop" themselves
- Annexificator.remove_other_versions which would remove files corresponding to "other than current" version, so only files pertinent to current version get processed/committed in incoming branch. Very magical function which even supports "overlays" (e.g. whenever only one of the files for new .patch version is provided while others should be taken from the previous major.minor. version(s))
simple_s3 which would "automagically" commit while crawling S3 buckets if a new version of a file or deletion of a file which was not yet committed is detected.

Pipelines reuse

simple_with_archives pipeline is already reused one way or another in other pipelines (simple_s3, stanford_lib) to provide multi-branch setup for implementing archives processing (extraction, cleanup, etc) via multiple branches:

incoming for plain as is crawled materials
incoming-processed - only automatically processed/merged data - no manual changes done in that branch, e.g. extracted from archives. To make that happen it "-mtheirs" the incoming branch, and then performs processing (extraction) and commits
master - typically anincoming-processed` branch state (merged normally) with whatever else desired to be done (e.g. metadata aggregation, manual fixes etc)

`addurls`

Datalad "core" now has addurls which provides quite an extended/flexible implementation to populate a dataset (including git annex metadata) from a set of records, e.g. as typically provided in .tsv or .json file. But it doesn't provide crawler's functionality of being able to monitor remote urls (or those entire records) for changes

So in principle, based on those experiences, and having additional desires in mind (be able to do multiple pipelines in the same dataset, may be different branches) it should be worth producing some "ultimate" pipeline which would rely on obtaining records with source urls, or about versions etc, and perform all necessary steps to handle versioning etc. Then specialized pipelines would only produce provider specific logic feeding that pipeline with those records.

It might be worth approaching this after (or while working on) #20 solution which would provide yet another "versioned" provider to see how such pipeline could generalize for all openfmri/s3/figshare cases.

I think an additional item to the list is handling of subdatasets, so dumping some "thinking out loud" in here

Subdatasets

ATM crawlers such as openfmri, crcns, etc rely on a dedicated function specified to be used to return a dedicated pipeline for the top level super dataset, which would create subdatasets while populating them with per-subdataset crawl configuration. s3_simple pipeline can be instructed to create a subdataset for each subdirectory it finds at "this level" which is what we would want e.g. to separate each subject into independent subdataset for HCP, or what we do already for some crawled INDI datasets with a subdataset per site. But it is inflexible - we cannot prescribe to generate subdatasets for some directories but not for others, or say to do that for up to X levels of subdirectories.

addurls in datalad-core also has functionality to establish subdataset boundaries by using // in the path specification. So in case of HCP it would have been smth like %(subject)//.... Which is quite nice in flexibility to have a single prescription for subdatasets across multiple levels, but has no way to make them conditional either.

In general it seems that it would be nice to be able to specify more flexibly to

either a given subdirectory/path should be a subdataset - probably a regular expression based on the target path
prescribe what to do with it - probably via a list of procedures to be ran, and crawling configuration (might be as easy as "inherit"?) to be saved.

E.g. for HCP, if we decide to split into subdatasets at the subject level, and then all subdirectories which do not match release-notes to be also subdatasets, expressing idea in yaml for now could be something like:

subdatasets:
 - path_regex: "^[0-9]{6}"
   crawler_config: inherit
   procedures: 
   - cfg_text2git
 - full_path_regex: ".*/[0-9]{6}/(?!release-notes)"
   crawler_config: inherit
   procedures: 
   - hcp_subject_data_dataset
   - name: cfg_metadatatypes
     args:
     - dicom
     - nifti1
     - xmp

to be specified at the top level dataset, so it could be inherited and used in subdatasets as is. But may be it would be undesired so that "^[0-9]{6}" doesn't match some data directory named that way within subdatasets? I introduce matching by full_path_regex since in subdatasets it wouldn't not know super's name (or we could introduce some superds_path to avoid really full path matching). Sure thing we could also just rely on procedures to establish per subdataset crawling configuration, and then do not inherit the same one from the top level, but I wonder if we could achieve "one crawl config to be sufficient to describe it all".

In above hcp_subject_data_dataset procedure is just for demo purpose - not sure what custom should be done in there if we allow for flexible specification of parametrized procedures to be ran. But I just wanted to demonstrate that we should allow to mix in procedures with and without parameters.

With such setup we could also adjust for preprocessed per-task folders in HCP to be subdatasets with smth like

 - full_path_regex: ".*/[0-9]{6}/[^/]+/Results/[^/]+"
 ...

(potentially just mixing it into regex for the parent dataset)

A somewhat alternative specification organization could be to orient it around "paths", with default action being "addurl" (what is now), while allowing for others ("subdataset", etc)

paths:
 - path_regex: "^[0-9]{6}$"
   action: subdataset
   procedures:
   - inherit_crawler_config 
   - cfg_text2git
 - full_path_regex: ".*/[0-9]{6}/(?!release-notes)$"
   action: subdataset
   procedures: 
   - inherit_crawler_config
   - hcp_subject_data_dataset
   - name: cfg_metadatatypes
     args:
     - dicom
     - nifti1
     - xmp

so we could use the same specification to provide alternative actions, such as e.g. "skip" (currently some pipelines allow for :

paths:
 - path: .xdlm$
   action: skip

attn @mih et al (@kyleam @bpoldrack @TobiasKadelka) who might be interested:

While trying to come up with a temporary hack for current simple_s3 pipeline for more flexible decision making on creating subdatasets (for HCP), I realized that we do have to decide between two possible ways to go:

(current setup) subdatasets have their crawling configuration, but then superdataset crawling might be uselessly traversing an entire tree just to have those paths which belong to subdatasets actually ignored (since crawling in subdatasets should be done within subdatasets).
crawler is finally RFed to use DataLad's high level API which seamlessly crosses datasets boundaries and superdataset is the one which carries actually used crawling configuration and operations are done on the entire tree of subdatasets.

I am leaning toward 2. Now a bit more on each of those:

1

Benefits of 1. is ability to later on take a collection of subdatasets, and recrawl them independently. Something yet to be attempted/used in a wild.

With current simple_s3, there is two modes depending on directory setting:

no special directory handling, so crawler just goes through the bucket recursively
if directory handling is specified, there is no recursion (since every directory becomes subdataset, and gets crawled independently)

Cons: Unfortunately, generally, with the overall design of "url producer -> [more nodes] -> annexificator (pretty much addurl)" pipeline, there is no easy way to tell "url producer" (e.g. s3 bucket listing procedure) to not go into some "folders", since within [more nodes] some path renames might happen and thus early decision making based on paths to submodules (which might have been established in the initial run) upon rerun wouldn't "generalize". We could provide some ad-hoc option "ignore paths within submodules", but IMHO that would not be a proper solution. So, the only way to make it work is via an often expensive traversal of the entire tree while crawling superdataset, while effectively ignoring all paths which lead into subdatasets.

May be there is some overall crawler pipelineing solution possible (to instantiate crawlers within subdatasets where path goes into subdataset, and somehow feed them with those records), but it would fall into the same trap as outlined below -- crawling individual subdatasets would potentially be different from crawling from superdataset.

2

Going forward, I kinda like this way better since it would

allow for proper versioning of the entire tree of (sub)datasets -- ATM crawling within subdataset would have its own history, and only later a commit in superdataset would "bind" them together at that last commit
centralization of crawling configuration so there is no chance to get some inconsistency across the hierarchy
resultant hierarchy would be guaranteed to be consistent (with 1 above you can end up with a version of superdataset which inconsistent states of subdatasets)
probably it will result for quite some code removal from crawler in favor of using DataLad's higher level API
more inline with current operation of addurls in the DataLad core (datasets hierarchy is populated from "superdataset" if there are subdatasets defined)

Cons:

(implementation) would require crawler code overhaul (to go away from .repo to .dataset and to use our new API)
would not be possible to recrawl individual subdatasets
would be hard-to-impossible to tune per-subdataset crawling (beyond behaviors which could be adjusted within each dataset via .datalad/config and .gitattributes settings, so some flexibility remains)

2 with config for 1 (mneh)

We could still populate crawler configuration within subdatasets, but they would lack "versioning" information (although may be there is a workaround via storing versioning information updates in each subdataset along the path upon each new file being added/updated). Even if we populate all needed for recrawling pieces, since it will not actually be the crawling configuration used originally, it would be fragile etc. and re-crawling them individually would probably fail and/or result in the full recrawling of the subdataset.

2 would still allow for 1

It should still be possible where desired to not recurse and stop at the subdirectory/subdataset boundary (with simple_s3 pipeline I mean) where really desired.

FTR: .csv file with two sample urls http://www.onerussian.com/tmp/hcp-sample-urls.csv for an invocation datalad addurls ../testhcp/urls.csv '{original_url}' '{subject}//{preprocessing}//{filename}' where we add // dataset boundary within a filename column field thus providing desired flexibility for an arbitrary dataset level splitting.

4 @mih: when producing the table add "last-modified" and then include versionId into the url. That would later help to produce a "diff" so an "update" could be done for addurls by just providing a table with new/changed entries.

(git)smaug:…sets/datalad/crawl-misc/hcp/101006[hcp900-ontop].datalad/crawl/versions
$> less hcp500.json 
{
  "db_version": 1,
  "version": {
    "last-modified": "2015-01-26T05:01:24.000Z",
    "name": "HCP/101006/MNINonLinear/Results/rfMRI_REST2_LR/rfMRI_REST2_LR_hp2000_clean.nii.gz",
    "version-id": "M2h5DcwaHmJl8nJ08uzmFrB7OFDAM_.n"
  },
  "versions": []
}

$> datalad download-url s3://hcp-openaccess/HCP_900/100408/release-notes/Diffusion_unproc.txt?versionId=QFYpcINyEZAQKM5atCIKqcZQ_LRLA607

datalad / datalad-crawler