Open yarikoptic opened 4 months ago
Hi @yarikoptic, I'm glad you mentioned this!
There are a bunch of possibilities floating around in my mind. I'll just toss a couple out there to see what you think.
Let's start with just adding text notes. I know that structured annotations (labels on units, etc) are more useful, but it's good to start with a simple case. From a user's perspective they should be able to click to add a note to any neurodata object within an NWB file that is loaded into Neurosift. That note could be visible to them and also to any other viewer of that particular NWB file. This could include top-level notes that would apply to the entire file.
It gets interesting when we start to think about (a) where those notes would get stored, and (b) how they get stored there. For (b), let's assume we have figured out the github authentication stuff so that the web app (neurosift) will have the ability to act on GitHub on behalf of the user. I can think of a few possibilities for (a)
Thoughts?
I was thinking of the 3rd option:
Each user has their own repo for DANDI annotations. It could be called dandi_annotations, and everyone has their own. The annotations get stored as files in that repo.
so that user has full control etc over it, we could benefit from whatever they want to configure it how (private, collaborators etc). neurosift could provide a github app which gets registered against account to manipulate that repo. Somewhere we just store the URL for that repo... may be even a cookie or may be there is a way to store some settings (like a cookie) within github account itself for the app so as soon as app is registered -- there is a way to discover which repo to contain annotations.
Sounds good. If it's a public repo... do we want other viewers of the NWB file in neurosift to be able to see that there are annotations - and perhaps click on them to view? Or do we want a user to only see their own annotations?
good question/point... in principle it should all be up to a user. I guess there might be some configuration file where user could instruct one way or another, and we could default (template) it to be "announced" (or whatever best description of being shown to others) by default.
How should we name the files in the repo containing the annotations
Like if I am annotating 000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb
then would there be a file in the repo like this:
dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb/annotations.jsonl
or what?
I was thinking of doing .jsonl (json lines) because then each action would get appended to the file. An action could be {type: "add-annotation", ...}
or {type: "remove-annotation", ...}
or whatever. So it would be an append-only log.
Or maybe there's a better scheme.
here comes a tricky part -- if we want to associate it with content, better to use asset_id or blob_id since content could change under that path. But those are too cryptic etc. I see two principled ways
annotations.json
to contain information about asset and blob ids (or checksum) so upon reload that information could be verified -- may be file is no longer the same, user could be announced to verify that annotations still apply etc... (or just make it annotations.json
to start -- not sure if there would be any benefit from being able to just add a line really)annotations-map.json
which would map from dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb
into blob_id
based tree storage of actual annotations.jsonl
. This way whenever blob/asset is reused across dandisets (e.g. someone mixes files from different dandisets), we could still quickly find annotations for it among other dandisets and add to the annotations-map.json
.proceed with path as you propose, but probably make also annotations.json to contain information about asset and blob ids (or checksum) so upon reload that information could be verified -- may be file is no longer the same, user could be announced to verify that annotations still apply etc... (or just make it annotations.json to start -- not sure if there would be any benefit from being able to just add a line really)
Makes sense. How about
dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb/2b9e441b-56bc-4be2-893e-0e02d22d239d/annotations.jsonl
so you can see that the path has the asset ID embedded in it.
And now I'm thinking of doing json-lines (jsonl) for a different reason. Instead of append-only log, each annotation would be on a separate line. That way you can add-action, delete-action, replace-action, etc., and all these operations would be (a) small-deltas and (b) the commit changes would be human-readable. This would be more difficult with a .json file.
@yarikoptic
Could you try this out when you have a chance? https://github.com/flatironinstitute/neurosift/blob/main/doc/neurosift_annotations.md
That way you can add-action, delete-action, replace-action, etc., and all these operations would be (a) small-deltas and (b) the commit changes would be human-readable. This would be more difficult with a .json file.
given that lists (and now even dicts) are ordered, I still do not see how jsonl would be beneficial, but that is ok -- you are the doer here, so do the way you see it best fit.
Could you try this out when you have a chance? https://github.com/flatironinstitute/neurosift/blob/main/doc/neurosift_annotations.md
NB dang chatgpt can't spell correctly... need to send a PR
@yarikoptic We'll need to think about how to share these annotations. What if I do some annotations (unit labels, or whatever), and then I want you to see them. What would I do?
yeap... note: now it is under dandisets/
prefix but that might then limit to only main dandi
archive. We also have staging etc, so may be there could be one more leading level -- "instance name", e.g. main one we call dandi
ATM. Our "registry" is here: https://github.com/dandi/dandi-cli/blob/master/dandi/consts.py#L119
Some thoughts:
"user":"unknown"
)But would be nice if people could benefit from discovery of annotations present for any given dandiset, for that we need to either
dandi
) instance we have https://github.com/dandisets, e.g. https://github.com/dandisets/000248.neurosift
would provide option/helper to "register" by filing such an issue or PR. If PR -- could be coming from the repository with annotations, some special branch.BTW, the question is not entirely unlike our discussion on association of notebooks with dandisets with @bendichter and @waxlamp
yeap... note: now it is under dandisets/ prefix but that might then limit to only main dandi archive. We also have staging etc, so may be there could be one more leading level -- "instance name", e.g. main one we call dandi ATM. Our "registry" is here: https://github.com/dandi/dandi-cli/blob/master/dandi/consts.py#L119
So it would be dandisets/dandi/000001/... ?
Or dandi/dandisets/.... ?
Or dandi/dandi/... ?
One consideration. GitHub apps can only make 5000 API requests per hour (per user). So if we had 10 separate repos contributing annotations to a particular NWB file, then whenever the page gets reloaded we might end up with around 20-30 api requests.
What I propose is to have a neurosift-annotations database that mirrors the annotation contents in github repos. This serves two purposes: (1) avoid excessive github api calls for the app - retrieve data from the database instead, documents expire every 60 seconds or something in case the contents get modified using a method other than neurosift UI; (2) allows retrieving annotations across all repos that have ever been loaded in neurosift. Note that the github repos would still be the source of truth for all annotations.
I thought of
Or dandi/dandisets/.... ?
aggregating DB or some other way indeed probably would eventually be needed, but I wonder if we should get to hitting closed to the limits first which might give better grounds for optimizations? ;-)
aggregating DB or some other way indeed probably would eventually be needed, but I wonder if we should get to hitting closed to the limits first which might give better grounds for optimizations? ;-)
But I don't know how else to retrieve annotations across all repos. I feel like this is an important step... and we still have the repos as source of truth...
My point is that ATM nobody yet has even capability to link multiple repositories and do not see people getting that many linked. May be we will hit limits even without that somehow and would need to abandon "(ab)use github" idea entirely. So why not to make it available, advocate and monitor if/when we get close to hitting those limits? FWIW in cron jobs of https://github.com/con/tinuous we also include output of
curl -fsSL -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit
just to see if we do error out -- how close we were to hitting the limits. I wonder if it is worth querying them once in a while and reacting somehow when getting close to it so we would get feedback on that.
But I'm not just talking about the rate limit issue. See my reason (2) above. Without an aggregation database I don't know how to discover other annotations from other repos.
if there is a unique filename/path (e.g. some neurosift-annotations.json
with some config or whatnot in the root of the repo) -- we can discover them automatically (per my post above), and then may be even aggregate within that "aggregation" repo -- that will be though not "online", but rather a "cron job". That is why why not to make it just an explicit addition of other annotation repos people could do for the repo?
Okay I see. That lookup tool looks pretty useful. Does it do a github search for that key file through the github api? I suppose this would only work for public repos.
I think I will move forward with the aggregated database approach since that is going to be easier for me. But I will make sure that the gh repos stay as the source of truth, so we could transition away from this if needed, to use a pure gh solution. I really would like an efficient way of querying all annotations for a particular nwb file... and if it is going to be done in a single request, it requires a query-able database.
Does it do a github search for that key file through the github api?
yes. here: https://github.com/datalad/datalad-usage-dashboard/blob/master/find_datalad_repos/github.py#L91
NB results of those discoveries then used to populate/update http://registry.datalad.org/ which you could navigate/query more interactively
I suppose this would only work for public repos.
I think so too.
and if it is going to be done in a single request, it requires a query-able database.
agreed that would require a single aggregated source. But IMHO the "DB" could be a (JSON or YAML) file for this purpose. But sure thing -- proceed you see it fit and which is easier for you. If anything -- could be redone later. Cheers and Thanks!
if DB: if there would be API or a way to check/get annotations (and may be their number) per each dandiset and/or path within -- we should then look into adding that to dandiarchive web UI. Would be great to have one more target use case in addition to notebooks so we finally come up with some "generic way" to integrate with such external resources, hopefully similarly flexible as we do for external services linkage for individual files.
if DB: if there would be API or a way to check/get annotations (and may be their number) per each dandiset and/or path within -- we should then look into adding that to dandiarchive web UI. Would be great to have one more target use case in addition to notebooks so we finally come up with some "generic way" to integrate with such external resources, hopefully similarly flexible as we do for external services linkage for individual files.
Sounds great!
@yarikoptic I have something working
Here's an example where I have put a top-level note in a public repo. You should be able to see it if you log in to neurosift-annotations.
This has the following properties:
I remember that there was some initial prototype for annotation . I wonder in what stage that development is? and how/what could be a way to integrate with annotation of data on DANDI? (e.g. may be as a github app of some kind to deposit actual annotations to github per each user)