flatironinstitute / dendro-old

Analyze neuroscience data in the cloud
https://flatironinstitute.github.io/dendro-docs/
Apache License 2.0
19 stars 2 forks source link

api changes needed for using SI json files for processed recordings #130

Closed magland closed 7 months ago

magland commented 7 months ago

NOTE: Existing apps will not be broken, but newly built apps will use the new API which is somewhat different in terms of how download URLs are obtained by processing jobs from the central API.

Motivation:

As discussed with @luiztauffer and @alejoe91, we are moving toward using spikeinterface json files as the outputs of preprocessing and motion correction and as the inputs to spike sorting. This requires some adjustments to the dendro API.

Some blurbs from slack conversation:

I do like Alessio's idea of having the output of preprocessing be a SI json file because it solves a lot of issues we were struggling with. In particular it means we don't need to store the very large preprocessed file anywhere. A downside is that the lazy preprocessing will need to take place within the spike sorting processor, but I think that's going to be short compared to the spike sorting itself. It's true we may run into issues with parallel reading of the remote file, but hopefully not, and we'll deal with that if we need to.

I'd prefer not to have a scratch space that processors can use because thus far dendro jobs are containerized without any access to data from other jobs. I think it's wise to keep it that way if possible. It also allows jobs to be run anywhere, not depending on a particular environment.

In order to make this work, the SI json file needs to have embedded in it a reference to the remote NWB file. A simple URL doesn't work here because the data may be embargoed. Or maybe it is on an S3 bucket that requires special authorization that only the central dendro system has credentials for. So I thought about this on some long walks through the city :slightly_smiling_face: , and I settled on having a URI of the form dendro:?file_id=[unique-id-of-file-on-dendro]... That gets embedded in the JSON file as a kwarg to the NWBRecordingExtractor. When the processor job needs a download URL for that file, it requests it from the dendro api, providing its job private key. The new capability is that we allow jobs to get read access to any file within the project, not just its direct inputs which was the way it was before. Thus the JSON file can contain references to other files within the project. And this way we don't need to pass the original NWB file as another input to spike sorting or any other downstream job.

Until we can add this support to SI (and we'll need to think about whether dendro-specific support is appropriate for SI), we'll need to use a custom NWB Recording Extractor that can handle this. It also needs to be able to auto-renew the download URL during processing because the job may take longer than an hour. So these are two additional capabilities the custom extractor needs to have.

The crucial new part is the dendro uri of the form dendro:?file_id=[unique-id-of-file-on-dendro]... and "When the processor job needs a download URL for that file, it requests it from the dendro api, providing its job private key. The new capability is that we allow jobs to get read access to any file within the project, not just its direct inputs which was the way it was before."

This PR makes the needed adjustments.