SciCrunch / sparc-curation

code and files for SPARC curation workflows
MIT License
14 stars 12 forks source link

scaffold ingest issue #27

Open tgbugs opened 4 years ago

tgbugs commented 4 years ago

From slack. @Tehsurfer

Jesse Khorasanee

  1. Scaffolds are contained in the Derivatives folder of BIDS structured datsets
  2. Tom's Blackfynn scraper is let known of the files through manifest.xlsx, contained in derivatives folder
  3. curation-export.ttl contains TEMP:hasScaffold ; and TEMP:hasChart ; drawn from manifest.xlsx
  4. Values of hasScaffold and hasChart are pulled from SciCrunch search API on portal backend.
  5. Using the above sourcepackageID we will retrive the s3 uri: uri": "s3://blackfynn-discover-use1/... from the blackfynn discover API. This will be done using the discorver API endpoint: https://api.blackfynn.io/discover/search/files
  6. Files will be downloaded via a proxy the portal's backend s3 account, similar to the function below:
tgbugs commented 4 years ago

I will add two predicates TEMP:hasABIScaffold and TEMP:hasABIChart domain dataset range sparc:Resource N:collection:. I will maintain the internal structure of those folders probably using TEMP:hasPart (need a bit more thought) so that the package ids can be retrieved and used to query the discover endpoint by id.

A potential query pattern in sparql.

SELECT ?package-id WHERE {
?dataset TEMP:hasABIScaffold ?collection .
?collection TEMP:hasPart* ?package .
?package TEMP:hasBfId ?package-id .
}

Another pattern in cypher.

MATCH (dataset)
-[:TEMP:hasABIScaffold]->(collection)
-[:TEMP:hasPart*0..5]->(package)
RETURN package.`TEMP:hasBfId`

Not sure if the final property query will work without trying it. If not then extraction of id from uri is completed by parsing the known BF api uri structure.

tgbugs commented 4 years ago

A though on one way to do the type here by using a vendor mimetype (abusing the type just a bit by not registering it). inode/vnd.abi.scaffold+directory

tgbugs commented 4 years ago

Suggest to put the mimetype under the column named 'additional types' in the manifest. This is what I suggested to MBF as well for providing data about their xml files. Also suggest to put the value directory in the 'file type' column. @ankap2

tgbugs commented 4 years ago

Need to lift the manifest information out of the manifest_records so that the paths show up in the list of paths included per dataset, and then lift the paths that have the inode/vnd.abi.scaffold+directory type to be in the scaffolds array.

Also expecting a scaffold folder internal manifest file that will have additional information needed to flesh out a scaffold object with things like the thumbnail file.

Also lift manifest organ and species up to the dataset metadata level.