datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

Generic framework for crawling data providers with versions #22

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

There are two aspects:

Versioning

ATM some crawling pipelines do care/version datasets, e.g.

Pipelines reuse

simple_with_archives pipeline is already reused one way or another in other pipelines (simple_s3, stanford_lib) to provide multi-branch setup for implementing archives processing (extraction, cleanup, etc) via multiple branches:

addurls

Datalad "core" now has addurls which provides quite an extended/flexible implementation to populate a dataset (including git annex metadata) from a set of records, e.g. as typically provided in .tsv or .json file. But it doesn't provide crawler's functionality of being able to monitor remote urls (or those entire records) for changes

So in principle, based on those experiences, and having additional desires in mind (be able to do multiple pipelines in the same dataset, may be different branches) it should be worth producing some "ultimate" pipeline which would rely on obtaining records with source urls, or about versions etc, and perform all necessary steps to handle versioning etc. Then specialized pipelines would only produce provider specific logic feeding that pipeline with those records.

It might be worth approaching this after (or while working on) #20 solution which would provide yet another "versioned" provider to see how such pipeline could generalize for all openfmri/s3/figshare cases.

yarikoptic commented 4 years ago

I think an additional item to the list is handling of subdatasets, so dumping some "thinking out loud" in here

Subdatasets

ATM crawlers such as openfmri, crcns, etc rely on a dedicated function specified to be used to return a dedicated pipeline for the top level super dataset, which would create subdatasets while populating them with per-subdataset crawl configuration. s3_simple pipeline can be instructed to create a subdataset for each subdirectory it finds at "this level" which is what we would want e.g. to separate each subject into independent subdataset for HCP, or what we do already for some crawled INDI datasets with a subdataset per site. But it is inflexible - we cannot prescribe to generate subdatasets for some directories but not for others, or say to do that for up to X levels of subdirectories.

addurls in datalad-core also has functionality to establish subdataset boundaries by using // in the path specification. So in case of HCP it would have been smth like %(subject)//.... Which is quite nice in flexibility to have a single prescription for subdatasets across multiple levels, but has no way to make them conditional either.

In general it seems that it would be nice to be able to specify more flexibly to

  1. either a given subdirectory/path should be a subdataset - probably a regular expression based on the target path
  2. prescribe what to do with it - probably via a list of procedures to be ran, and crawling configuration (might be as easy as "inherit"?) to be saved.

E.g. for HCP, if we decide to split into subdatasets at the subject level, and then all subdirectories which do not match release-notes to be also subdatasets, expressing idea in yaml for now could be something like:

subdatasets:
 - path_regex: "^[0-9]{6}"
   crawler_config: inherit
   procedures: 
   - cfg_text2git
 - full_path_regex: ".*/[0-9]{6}/(?!release-notes)"
   crawler_config: inherit
   procedures: 
   - hcp_subject_data_dataset
   - name: cfg_metadatatypes
     args:
     - dicom
     - nifti1
     - xmp

to be specified at the top level dataset, so it could be inherited and used in subdatasets as is. But may be it would be undesired so that "^[0-9]{6}" doesn't match some data directory named that way within subdatasets? I introduce matching by full_path_regex since in subdatasets it wouldn't not know super's name (or we could introduce some superds_path to avoid really full path matching). Sure thing we could also just rely on procedures to establish per subdataset crawling configuration, and then do not inherit the same one from the top level, but I wonder if we could achieve "one crawl config to be sufficient to describe it all".

In above hcp_subject_data_dataset procedure is just for demo purpose - not sure what custom should be done in there if we allow for flexible specification of parametrized procedures to be ran. But I just wanted to demonstrate that we should allow to mix in procedures with and without parameters.

With such setup we could also adjust for preprocessed per-task folders in HCP to be subdatasets with smth like

 - full_path_regex: ".*/[0-9]{6}/[^/]+/Results/[^/]+"
 ...

(potentially just mixing it into regex for the parent dataset)

A somewhat alternative specification organization could be to orient it around "paths", with default action being "addurl" (what is now), while allowing for others ("subdataset", etc)

paths:
 - path_regex: "^[0-9]{6}$"
   action: subdataset
   procedures:
   - inherit_crawler_config 
   - cfg_text2git
 - full_path_regex: ".*/[0-9]{6}/(?!release-notes)$"
   action: subdataset
   procedures: 
   - inherit_crawler_config
   - hcp_subject_data_dataset
   - name: cfg_metadatatypes
     args:
     - dicom
     - nifti1
     - xmp

so we could use the same specification to provide alternative actions, such as e.g. "skip" (currently some pipelines allow for :

paths:
 - path: .xdlm$
   action: skip
yarikoptic commented 4 years ago

attn @mih et al (@kyleam @bpoldrack @TobiasKadelka) who might be interested:

While trying to come up with a temporary hack for current simple_s3 pipeline for more flexible decision making on creating subdatasets (for HCP), I realized that we do have to decide between two possible ways to go:

  1. (current setup) subdatasets have their crawling configuration, but then superdataset crawling might be uselessly traversing an entire tree just to have those paths which belong to subdatasets actually ignored (since crawling in subdatasets should be done within subdatasets).

  2. crawler is finally RFed to use DataLad's high level API which seamlessly crosses datasets boundaries and superdataset is the one which carries actually used crawling configuration and operations are done on the entire tree of subdatasets.

I am leaning toward 2. Now a bit more on each of those:

1

Benefits of 1. is ability to later on take a collection of subdatasets, and recrawl them independently. Something yet to be attempted/used in a wild.

With current simple_s3, there is two modes depending on directory setting:

Cons: Unfortunately, generally, with the overall design of "url producer -> [more nodes] -> annexificator (pretty much addurl)" pipeline, there is no easy way to tell "url producer" (e.g. s3 bucket listing procedure) to not go into some "folders", since within [more nodes] some path renames might happen and thus early decision making based on paths to submodules (which might have been established in the initial run) upon rerun wouldn't "generalize". We could provide some ad-hoc option "ignore paths within submodules", but IMHO that would not be a proper solution. So, the only way to make it work is via an often expensive traversal of the entire tree while crawling superdataset, while effectively ignoring all paths which lead into subdatasets.

May be there is some overall crawler pipelineing solution possible (to instantiate crawlers within subdatasets where path goes into subdataset, and somehow feed them with those records), but it would fall into the same trap as outlined below -- crawling individual subdatasets would potentially be different from crawling from superdataset.

2

Going forward, I kinda like this way better since it would

Cons:

2 with config for 1 (mneh)

We could still populate crawler configuration within subdatasets, but they would lack "versioning" information (although may be there is a workaround via storing versioning information updates in each subdataset along the path upon each new file being added/updated). Even if we populate all needed for recrawling pieces, since it will not actually be the crawling configuration used originally, it would be fragile etc. and re-crawling them individually would probably fail and/or result in the full recrawling of the subdataset.

2 would still allow for 1

It should still be possible where desired to not recurse and stop at the subdirectory/subdataset boundary (with simple_s3 pipeline I mean) where really desired.

yarikoptic commented 4 years ago

FTR: .csv file with two sample urls http://www.onerussian.com/tmp/hcp-sample-urls.csv for an invocation datalad addurls ../testhcp/urls.csv '{original_url}' '{subject}//{preprocessing}//{filename}' where we add // dataset boundary within a filename column field thus providing desired flexibility for an arbitrary dataset level splitting.

4 @mih: when producing the table add "last-modified" and then include versionId into the url. That would later help to produce a "diff" so an "update" could be done for addurls by just providing a table with new/changed entries.

(git)smaug:…sets/datalad/crawl-misc/hcp/101006[hcp900-ontop].datalad/crawl/versions
$> less hcp500.json 
{
  "db_version": 1,
  "version": {
    "last-modified": "2015-01-26T05:01:24.000Z",
    "name": "HCP/101006/MNINonLinear/Results/rfMRI_REST2_LR/rfMRI_REST2_LR_hp2000_clean.nii.gz",
    "version-id": "M2h5DcwaHmJl8nJ08uzmFrB7OFDAM_.n"
  },
  "versions": []
}

$> datalad download-url s3://hcp-openaccess/HCP_900/100408/release-notes/Diffusion_unproc.txt?versionId=QFYpcINyEZAQKM5atCIKqcZQ_LRLA607