annotation of timelines / units / ...

yarikoptic commented 4 months ago

I remember that there was some initial prototype for annotation . I wonder in what stage that development is? and how/what could be a way to integrate with annotation of data on DANDI? (e.g. may be as a github app of some kind to deposit actual annotations to github per each user)

magland commented 4 months ago

Hi @yarikoptic, I'm glad you mentioned this!

There are a bunch of possibilities floating around in my mind. I'll just toss a couple out there to see what you think.

Let's start with just adding text notes. I know that structured annotations (labels on units, etc) are more useful, but it's good to start with a simple case. From a user's perspective they should be able to click to add a note to any neurodata object within an NWB file that is loaded into Neurosift. That note could be visible to them and also to any other viewer of that particular NWB file. This could include top-level notes that would apply to the entire file.

It gets interesting when we start to think about (a) where those notes would get stored, and (b) how they get stored there. For (b), let's assume we have figured out the github authentication stuff so that the web app (neurosift) will have the ability to act on GitHub on behalf of the user. I can think of a few possibilities for (a)

We have a single repo for all DANDI notes/annotations, and they get stored as files. If a user wants to add annotations, they need to get write-access to the repo.
We have an empty repo for all DANDI notes/annotations, and the annotations get submitted as issue comments on that repo automatically via the gh API. Then the user doesn't need to have write access. Perhaps one issue per dandiset, or one issue per NWB file. The comments would be yaml commands to add/delete, etc. who knows.
Each user has their own repo for DANDI annotations. It could be called dandi_annotations, and everyone has their own. The annotations get stored as files in that repo.
We don't use github at all, but maintain our own database of annotations.
Other ideas?

Thoughts?

yarikoptic commented 4 months ago

I was thinking of the 3rd option:

Each user has their own repo for DANDI annotations. It could be called dandi_annotations, and everyone has their own. The annotations get stored as files in that repo.

so that user has full control etc over it, we could benefit from whatever they want to configure it how (private, collaborators etc). neurosift could provide a github app which gets registered against account to manipulate that repo. Somewhere we just store the URL for that repo... may be even a cookie or may be there is a way to store some settings (like a cookie) within github account itself for the app so as soon as app is registered -- there is a way to discover which repo to contain annotations.

magland commented 4 months ago

Sounds good. If it's a public repo... do we want other viewers of the NWB file in neurosift to be able to see that there are annotations - and perhaps click on them to view? Or do we want a user to only see their own annotations?

yarikoptic commented 4 months ago

good question/point... in principle it should all be up to a user. I guess there might be some configuration file where user could instruct one way or another, and we could default (template) it to be "announced" (or whatever best description of being shown to others) by default.

magland commented 4 months ago

How should we name the files in the repo containing the annotations

Like if I am annotating 000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb

then would there be a file in the repo like this:

dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb/annotations.jsonl

or what?

I was thinking of doing .jsonl (json lines) because then each action would get appended to the file. An action could be {type: "add-annotation", ...} or {type: "remove-annotation", ...} or whatever. So it would be an append-only log.

Or maybe there's a better scheme.

yarikoptic commented 4 months ago

here comes a tricky part -- if we want to associate it with content, better to use asset_id or blob_id since content could change under that path. But those are too cryptic etc. I see two principled ways

proceed with path as you propose, but probably make also annotations.json to contain information about asset and blob ids (or checksum) so upon reload that information could be verified -- may be file is no longer the same, user could be announced to verify that annotations still apply etc... (or just make it annotations.json to start -- not sure if there would be any benefit from being able to just add a line really)
have some top level annotations-map.json which would map from dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb into blob_id based tree storage of actual annotations.jsonl. This way whenever blob/asset is reused across dandisets (e.g. someone mixes files from different dandisets), we could still quickly find annotations for it among other dandisets and add to the annotations-map.json.

magland commented 4 months ago

proceed with path as you propose, but probably make also annotations.json to contain information about asset and blob ids (or checksum) so upon reload that information could be verified -- may be file is no longer the same, user could be announced to verify that annotations still apply etc... (or just make it annotations.json to start -- not sure if there would be any benefit from being able to just add a line really)

Makes sense. How about

dandisets/000582/sub-10073/sub-10073_ses-17010302_behavior+ecephys.nwb/2b9e441b-56bc-4be2-893e-0e02d22d239d/annotations.jsonl

so you can see that the path has the asset ID embedded in it.

And now I'm thinking of doing json-lines (jsonl) for a different reason. Instead of append-only log, each annotation would be on a separate line. That way you can add-action, delete-action, replace-action, etc., and all these operations would be (a) small-deltas and (b) the commit changes would be human-readable. This would be more difficult with a .json file.

magland commented 4 months ago

@yarikoptic

Could you try this out when you have a chance? https://github.com/flatironinstitute/neurosift/blob/main/doc/neurosift_annotations.md

yarikoptic commented 4 months ago

That way you can add-action, delete-action, replace-action, etc., and all these operations would be (a) small-deltas and (b) the commit changes would be human-readable. This would be more difficult with a .json file.

given that lists (and now even dicts) are ordered, I still do not see how jsonl would be beneficial, but that is ok -- you are the doer here, so do the way you see it best fit.

yarikoptic commented 4 months ago

Could you try this out when you have a chance? https://github.com/flatironinstitute/neurosift/blob/main/doc/neurosift_annotations.md

NB dang chatgpt can't spell correctly... need to send a PR

magland commented 4 months ago

@yarikoptic We'll need to think about how to share these annotations. What if I do some annotations (unit labels, or whatever), and then I want you to see them. What would I do?

yarikoptic commented 3 months ago

yeap... note: now it is under dandisets/ prefix but that might then limit to only main dandi archive. We also have staging etc, so may be there could be one more leading level -- "instance name", e.g. main one we call dandi ATM. Our "registry" is here: https://github.com/dandi/dandi-cli/blob/master/dandi/consts.py#L119

Some thoughts:

collaborations: people likely should be able to use the same (e.g. lab's) repository with annotations
- each annotation carries the github id of a person who added it? (in my example annotations it has "user":"unknown")
in the UI for annotations, to be able to add not just 1 read-write annotation repo but a list of repositories with annotations. All but one must be read-only so they could be seen but not edited, or each annotation would be attributed to source annotation repo and modified in there instead.

But would be nice if people could benefit from discovery of annotations present for any given dandiset, for that we need to either

centralize:
- central DB would be ruining distributed nature here I guess.
- abuse github more somehow. E.g. a crude idea:
  - for main (dandi) instance we have https://github.com/dandisets, e.g. https://github.com/dandisets/000248.
  - anyone can submit a PR or an issue against it...
  - we could rely on listing of PRs or issues with specific descriptions/labels. neurosift would provide option/helper to "register" by filing such an issue or PR. If PR -- could be coming from the repository with annotations, some special branch.
discover: we could add autodiscovery of annotations to (or create similar alike, https://github.com/datalad/datalad-usage-dashboard/issues/26) https://github.com/datalad/datalad-usage-dashboard/
- probably an overkill since ATM we have 1 neurosift, so not really distributed and it can carry the knowledge of all annotations potentially present right?

yarikoptic commented 3 months ago

BTW, the question is not entirely unlike our discussion on association of notebooks with dandisets with @bendichter and @waxlamp

https://github.com/dandi/dandi-archive/discussions/1867

magland commented 3 months ago

yeap... note: now it is under dandisets/ prefix but that might then limit to only main dandi archive. We also have staging etc, so may be there could be one more leading level -- "instance name", e.g. main one we call dandi ATM. Our "registry" is here: https://github.com/dandi/dandi-cli/blob/master/dandi/consts.py#L119

So it would be dandisets/dandi/000001/... ?

Or dandi/dandisets/.... ?

Or dandi/dandi/... ?

One consideration. GitHub apps can only make 5000 API requests per hour (per user). So if we had 10 separate repos contributing annotations to a particular NWB file, then whenever the page gets reloaded we might end up with around 20-30 api requests.

What I propose is to have a neurosift-annotations database that mirrors the annotation contents in github repos. This serves two purposes: (1) avoid excessive github api calls for the app - retrieve data from the database instead, documents expire every 60 seconds or something in case the contents get modified using a method other than neurosift UI; (2) allows retrieving annotations across all repos that have ever been loaded in neurosift. Note that the github repos would still be the source of truth for all annotations.

yarikoptic commented 3 months ago

I thought of

Or dandi/dandisets/.... ?

aggregating DB or some other way indeed probably would eventually be needed, but I wonder if we should get to hitting closed to the limits first which might give better grounds for optimizations? ;-)

magland commented 3 months ago

aggregating DB or some other way indeed probably would eventually be needed, but I wonder if we should get to hitting closed to the limits first which might give better grounds for optimizations? ;-)

But I don't know how else to retrieve annotations across all repos. I feel like this is an important step... and we still have the repos as source of truth...

yarikoptic commented 3 months ago

My point is that ATM nobody yet has even capability to link multiple repositories and do not see people getting that many linked. May be we will hit limits even without that somehow and would need to abandon "(ab)use github" idea entirely. So why not to make it available, advocate and monitor if/when we get close to hitting those limits? FWIW in cron jobs of https://github.com/con/tinuous we also include output of

curl -fsSL -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/rate_limit

just to see if we do error out -- how close we were to hitting the limits. I wonder if it is worth querying them once in a while and reacting somehow when getting close to it so we would get feedback on that.

magland commented 3 months ago

But I'm not just talking about the rate limit issue. See my reason (2) above. Without an aggregation database I don't know how to discover other annotations from other repos.

yarikoptic commented 3 months ago

if there is a unique filename/path (e.g. some neurosift-annotations.json with some config or whatnot in the root of the repo) -- we can discover them automatically (per my post above), and then may be even aggregate within that "aggregation" repo -- that will be though not "online", but rather a "cron job". That is why why not to make it just an explicit addition of other annotation repos people could do for the repo?

magland commented 3 months ago

Okay I see. That lookup tool looks pretty useful. Does it do a github search for that key file through the github api? I suppose this would only work for public repos.

I think I will move forward with the aggregated database approach since that is going to be easier for me. But I will make sure that the gh repos stay as the source of truth, so we could transition away from this if needed, to use a pure gh solution. I really would like an efficient way of querying all annotations for a particular nwb file... and if it is going to be done in a single request, it requires a query-able database.

yarikoptic commented 3 months ago

Does it do a github search for that key file through the github api?

yes. here: https://github.com/datalad/datalad-usage-dashboard/blob/master/find_datalad_repos/github.py#L91

NB results of those discoveries then used to populate/update http://registry.datalad.org/ which you could navigate/query more interactively

I suppose this would only work for public repos.

I think so too.

and if it is going to be done in a single request, it requires a query-able database.

agreed that would require a single aggregated source. But IMHO the "DB" could be a (JSON or YAML) file for this purpose. But sure thing -- proceed you see it fit and which is easier for you. If anything -- could be redone later. Cheers and Thanks!

yarikoptic commented 3 months ago

if DB: if there would be API or a way to check/get annotations (and may be their number) per each dandiset and/or path within -- we should then look into adding that to dandiarchive web UI. Would be great to have one more target use case in addition to notebooks so we finally come up with some "generic way" to integrate with such external resources, hopefully similarly flexible as we do for external services linkage for individual files.

magland commented 3 months ago

if DB: if there would be API or a way to check/get annotations (and may be their number) per each dandiset and/or path within -- we should then look into adding that to dandiarchive web UI. Would be great to have one more target use case in addition to notebooks so we finally come up with some "generic way" to integrate with such external resources, hopefully similarly flexible as we do for external services linkage for individual files.

Sounds great!

magland commented 3 months ago

@yarikoptic I have something working

Here's an example where I have put a top-level note in a public repo. You should be able to see it if you log in to neurosift-annotations.

https://neurosift.app/?p=/nwb&dandisetId=000409&dandisetVersion=draft&url=https://api.dandiarchive.org/api/assets/37ca1798-b14c-4224-b8f0-037e27725336/download/

This has the following properties:

If you log in to neurosift-annotations and you configure an existing repository (public or private) then you can add annotations/notes either top-level or per neurodata object
If a different user is also logged into neurosift-annotations, they will be able to see (readonly) all annotations for the nwb file that live in repos for which they have access.
If you edit the annotations.jsonl file directly outside of neurosift, you just need to wait up to 60 seconds, and then things will be updated.
This is using a database that (a) caches results that expire every 60 seconds to avoid excessive github api requests (b) enables discoverability of notes in other repos.

flatironinstitute / neurosift

annotation of timelines / units / ... #131