Open bendichter opened 3 months ago
quick one: I would feel great if such information ("Cited By ??" badge leading to listing) was displayed on DLP. How we make it happen is indeed worth thinking through, and indeed would depend on what service could provide us the "discovery". Similar problem openneuro has and pretty much did "manual labor" to figure out such citations and reason (use vs just referencing for some reason). I wonder if there is a way to make google scholar "index" datasets? I was hoping that may be dataset search of google could collect that info, but looking at a sample openneuro dataset I see no citation. But we can't easily (ab)use "Cited by" banner of some other service since for a single dandiset we could have multiple dois for different versions and something needs to aggregate their citations.
This would be great to include on the DLP as yet another way of demonstrating DANDI usage (in addition to the work in progress access stats)
Also great for reporting purposes I'd imagine
could someone check more if https://support.datacite.org/docs/consuming-citations-and-references could be the one to go after? all our DOIs are minted by datacite (through dartmouth library subscription). I could not resist so here is some crude script where I used their REST API on a list of our dandisets most recent (so not all versions per dandiset -- to be tuned!)
and looking at results which are not empty
❯ /usr/bin/find /tmp/citations -size '+1b' -ls
176439 0 drwx------ 2 yoh yoh 2640 Mar 20 13:38 /tmp/citations
176469 4 -rw------- 1 yoh yoh 1575 Mar 20 13:37 /tmp/citations/000055-0.220127.0436.json
176501 4 -rw------- 1 yoh yoh 1575 Mar 20 13:37 /tmp/citations/000231-0.220904.1554.json
176503 4 -rw------- 1 yoh yoh 1346 Mar 20 13:37 /tmp/citations/000235-0.230316.1600.json
176504 4 -rw------- 1 yoh yoh 1346 Mar 20 13:37 /tmp/citations/000236-0.230316.2031.json
176505 4 -rw------- 1 yoh yoh 1587 Mar 20 13:37 /tmp/citations/000237-0.230316.1655.json
176506 4 -rw------- 1 yoh yoh 1346 Mar 20 13:37 /tmp/citations/000238-0.230316.1519.json
176509 4 -rw------- 1 yoh yoh 4005 Mar 20 13:37 /tmp/citations/000252-0.230408.2207.json
176513 4 -rw------- 1 yoh yoh 1336 Mar 20 13:37 /tmp/citations/000301-0.230806.0034.json
176525 8 -rw------- 1 yoh yoh 5199 Mar 20 13:37 /tmp/citations/000469-0.240123.1806.json
176549 4 -rw------- 1 yoh yoh 1354 Mar 20 13:38 /tmp/citations/000623-0.240227.2023.json
176552 4 -rw------- 1 yoh yoh 1330 Mar 20 13:38 /tmp/citations/000630-0.230915.2257.json
176559 12 -rw------- 1 yoh yoh 8375 Mar 20 13:38 /tmp/citations/000673-0.240118.2135.json
176560 4 -rw------- 1 yoh yoh 1614 Mar 20 13:38 /tmp/citations/000678-0.231004.2146.json
176569 4 -rw------- 1 yoh yoh 1455 Mar 20 13:38 /tmp/citations/000934-0.240315.1754.json
we get some! 000458 is not in the list :-/ but looking inside for different types, interesting one seems to be
❯ for f in *json; do jq . $f | grep -E 'relation-type-id.*(references|is-supplement)' && echo $f; done
"relation-type-id": "references",
000055-0.220127.0436.json
"relation-type-id": "references",
000231-0.220904.1554.json
"relation-type-id": "is-supplemented-by",
000235-0.230316.1600.json
"relation-type-id": "is-supplemented-by",
000236-0.230316.2031.json
"relation-type-id": "is-supplemented-by",
000237-0.230316.1655.json
"relation-type-id": "is-supplemented-by",
000238-0.230316.1519.json
"relation-type-id": "references",
000301-0.230806.0034.json
"relation-type-id": "is-supplement-to",
"relation-type-id": "is-supplemented-by",
000469-0.240123.1806.json
"relation-type-id": "is-supplemented-by",
000623-0.240227.2023.json
"relation-type-id": "references",
000630-0.230915.2257.json
"relation-type-id": "references",
000678-0.231004.2146.json
e.g.
{
"id": "bbb655d0-5d76-481e-b6f1-b2cb2b457380",
"type": "events",
"attributes": {
"subj-id": "https://doi.org/10.1038/s41597-022-01280-y",
"obj-id": "https://doi.org/10.48324/dandi.000055/0.220127.0436",
"source-id": "crossref",
"relation-type-id": "references",
"total": 1,
"message-action": "add",
"source-token": "36c35e23-8757-4a9d-aacf-345e9b7eb50d",
"license": "https://creativecommons.org/publicdomain/zero/1.0/",
"occurred-at": "2022-04-21T10:45:13.000Z",
"timestamp": "2022-04-23T03:38:18.173Z"
},
so it points to https://www.nature.com/articles/s41597-022-01280-y which is paper telling that data was shared on DANDI.
So I think for now we could easily provide some basic "citations gatherer" service to run on cron, e.g. weekly, and produce badges per each dandiset . The question only would be how to integrate with the archive -- I do not think it should modify metadata record since that one could be later changed by the author(s)
note that this loosely relates also to @magland 's annotations -- we might want to post banner which would point to list of annotations for NWBs in the dandiset.. also relates to notebooks etc -- i.e. how should we build services which provide extra linkages which we do not want to become part of metadata records.
This is great, @yarikoptic! It looks like this could work well for automatically gathering citation information.
I very much support this idea. This feature would allow us to notify dataset owners when their data is reused, create a data reuse score for researchers like an h-index that can be used in performance evaluations / career advancement, show funders that standards and archives can generate new science and methods, and generally foster a culture of data sharing and reuse.
we might want to be able to manually add citation information for examples like this where high-profile papers use Dandisets but do not cite them in a way that our system will be able to detect.
I found many examples of such when searching for data reuse examples of dandisets (ad hoc listing here). Data are often not cited in the References section but in the Data Availability section, and I think DataCite / CrossRef does not pick those up. (editors need to do better at addressing this!) I also found DataCite to be more effective at finding examples than CrossRef.
Some general heuristics that I used were to search "dandi", "nwb", "dandiarchive.org", "neurophysiology data available" and related terms on google scholar.
I think LLMs are well-suited to help solve this problem, assuming papers can be scraped from pubmed/biorxiv/elsewhere (maybe using NeuroQuery?). The LLM could 1) detect that a DANDI dataset has been used and 2) distinguish between primary use, secondary use, and just referencing (maybe it could give a general score that a human can go in and review afterward).
Some related efforts:
Great info @rly -- thanks!!
Joining efforts on this sounds great! @effigies and @poldrack, are you aware of any work along these lines?
This is on our roadmap, but I do not believe we have started on this. I briefly looked into the DataCite API, but I didn't get as far as Yarik did. @nellh or @rwblair may have, so pinging them.
We have previously tasked @jbwexler with finding reuses and citations. I believe this was mostly scraping search engine results, but he might have thoughts here.
Agreed this would be a great feature to add for both Dandi and ON. I unfortunately don't have too much to add. My approach was basically a semi-automated version of: 1) Search google scholar for 'OpenNeuro' 2) Within the text of the results, find any word matching 'ds' followed by 6 numbers 3) Read a few sentences before and after each match to see if the word is actually referring to an ON dataset and whether it was actually re-used or if it was just mentioned for some other reason. Occasionally it was necessary to skim the paper as a whole to get the context.
The first two steps could of course be easily automated. If we skip the third to avoid the labor cost, that would leave us with a list of "papers that might mention this dataset". That seems potentially useful but probably too much room for error for something akin to an h-index.
I like the LLM idea to do step 3. That would be fun to try to get that working.
+1 for working with DataCite. There is emerging work happening related to the Global Data Citation Corpus that could be helpful here, and your use case for Dandiset might be an interesting test of what they have already. Because only citations that happen in the References section of articles are counted in the Crossref/DataCite shared EventsDB, they are working with CZI on applying AI (named entity recognition) to the scholarly record (PubMed?) to pull out unstructured data into something that makes sense. There is also separate work happening within the RRID ecosystem that you might want to consider. Following citation to its logical conclusion, its only valuable if we can find out later who cited what, etc. so there are two knowledge graph projects to check for fit: the DataCite PIDgraph and OpenAlex. So 2 things: 1) if you want to cite data in your publications correctly please put that in the References section of your paper :) 2) because very few people actually do that, the Global Data Citation Corpus (DataCite) is probably our best shot right now. (contact: Iratxe Puebla)
I suppose I can take up the baton here. In Python, for mere mortals:
import os
import requests
import json
from tqdm import tqdm
# get all published dandiset IDs
dandi_api_url = "https://api.dandiarchive.org/api/dandisets/"
params = {
"page_size": 1000,
"empty": "false",
"draft": "false",
"embargoed": "false",
}
headers = {"accept": "application/json"}
# Fetch the list of published dandisets
response = requests.get(dandi_api_url, headers=headers, params=params)
response.raise_for_status() # Check for HTTP request errors
published_dandisets = response.json()
published_dandisets_ids = [x["identifier"] for x in published_dandisets["results"]]
# get all versions of each published dandiset
all_versions = {}
for id_ in tqdm(published_dandisets_ids, desc="get dandiset versions"):
dandi_api_url = f"https://api.dandiarchive.org/api/dandisets/{id_}/versions"
params = {"page_size": 1000}
headers = {"accept": "application/json"}
response = requests.get(dandi_api_url, headers=headers, params=params)
versions = response.json()
all_versions[id_] = [x["version"] for x in versions["results"] if x["version"] != "draft"]
# Iterate over each version of each dandiset and fetch citation data from DataCite
from collections import defaultdict
from dateutil import parser
results = []
# iterate over versions of dandisets and get citations
for identifier, versions in tqdm(all_versions.items(), desc="get citations"):
for version in versions:
datacite_url = f"https://api.datacite.org/events?doi=10.48324/dandi.{identifier}/{version}"
citation_response = requests.get(datacite_url)
citation_response.raise_for_status()
citation_data = citation_response.json()
for x in citation_data["data"]:
if "dandi" in x["attributes"]["subj-id"]:
continue # exclude citations from other dandisets
results.append(
dict(
dandiset_id=identifier,
doi=x["attributes"]["subj-id"],
timestamp=parser.parse(x["attributes"]["timestamp"]),
)
)
import pandas as pd
df = pd.DataFrame(results)
df
dandiset_id | doi | timestamp | |
---|---|---|---|
0 | 000055 | https://doi.org/10.1038/s41597-022-01280-y | 2022-04-23 03:38:18.173000+00:00 |
1 | 000207 | https://doi.org/10.7554/elife.85786.3 | 2023-10-27 08:55:39.876000+00:00 |
2 | 000231 | https://doi.org/10.1038/s41597-022-01728-1 | 2022-10-14 08:55:30.912000+00:00 |
3 | 000235 | https://doi.org/10.7554/elife.83289 | 2023-10-26 08:55:31.168000+00:00 |
4 | 000236 | https://doi.org/10.7554/elife.83289 | 2023-10-26 08:55:31.206000+00:00 |
5 | 000237 | https://doi.org/10.7554/elife.83289 | 2023-10-27 08:55:07.770000+00:00 |
6 | 000238 | https://doi.org/10.7554/elife.83289 | 2023-10-26 08:55:31.253000+00:00 |
7 | 000301 | https://doi.org/10.1038/s41467-023-41755-z | 2023-10-09 08:55:20.707000+00:00 |
8 | 000630 | https://doi.org/10.1126/science.adf0805 | 2023-10-27 08:55:18.773000+00:00 |
9 | 000678 | https://doi.org/10.5281/zenodo.8408660 | 2023-12-08 22:01:36.955000+00:00 |
This is great! though may people don't actually cite the doi, which is why @jbwexler had to resort to the manual approach.
Yes, I also see a lot of references to unpublished dandisets that don't have DOIs so they don't show up here. Still, it's nice to get what we can from the fully automated approach. This might work better for ON.
definitely!
Yes, I also see a lot of references to unpublished dandisets that don't have DOIs so they don't show up here. Still, it's nice to get what we can from the fully automated approach. This might work better for ON.
A systematic (looking "forward") solution IMHO would be to provide DOIs for draft dandisets too. Related:
I found a paper (https://doi.org/10.1016/j.neuron.2023.08.005) that cites Dandiset 000458 (https://doi.org/10.48324/dandi.000458/0.230317.0039). When I went to the Dandiset landing page, I find that there are some papers associated with this Dandiset but not the paper that I found. This is because the paper is a secondary use of this Dandiset, and did not exist when the Dandise was published.
I think we are missing a huge opportunity here. If we want to influence the behavior of scientists to reuse data, one of the best ways to do that is to educate them about others that are already doing this behavior. In doing so, we will establish that this is a high-quality dataset worth analyzing, demonstrate that you can achieve publications through reuse of data, and advance social norms around using data. All the better if the publications are from high-impact journals like Neuron. Therefore, I think in some way indicating papers that use and cite a Dandiset should be a high priority. While GitHub-like stars, page views, and download stats are all very important, IMO this metric is even more important than all of those.
I think this should really go on the DLP, and should not be in control of the Dandiset owner. Ideally, this would reflect UX patterns that the user is already familiar with. For example every scientist is familiar with the Google Scholar "Cited By [x]" link:
I think the most straightforward UX solution would be to add a button here:
that says "Cited by [#]". Then that button would lead to a modal window that contains a list of papers that cite this Dandiset, formatted similarly to how this is done in Google Scholar:
This may not be ideal because it does not make the citation metrics as prominent as I would like, but it would be a massive improvement over not having this metric on the DLP at all.
Then the question is: how do we gather this information? It looks like this can be done with crossref (https://www.crossref.org/documentation/cited-by/retrieve-citations/), which would require credentials, and I don't know whether crossref even tracks using of DANDI DOIs.
opencitations provides a service for this that works on Science papers, e.g.
http://opencitations.net/index/coci/api/v1/citations/10.1126/science.abf4588
but not on Dandisets.http://opencitations.net/index/coci/api/v1/citations/10.48324/dandi.000458/0.230317.0039
returns an empty list. It is possible the citations has just not been indexed yet. This is hard to test because a lot of publications like https://www.nature.com/articles/s41586-023-06031-6 do not properly cite the Dandiset DOI. This is another issue: we might want to be able to manually add citation information for examples like this where high-profile papers use Dandisets but do not cite them in a way that our system will be able to detect.Once we have the DOIs of the citing papers, I can confirm that crossref is a great tool for gathering information about a specific publication. https://api.crossref.org/works/{doi} returns all the information we would need, e.g.
https://api.crossref.org/works/10.1126/science.abf4588
```python {'DOI': '10.1126/science.abf4588', 'ISSN': ['0036-8075', '1095-9203'], 'URL': 'http://dx.doi.org/10.1126/science.abf4588', 'abstract': 'Beyond putting on the DLP, this is a very important metric for us to track. Looking at publications over the last year or so, I am seeing examples of high-profile papers that use Dandisets that we don't even know about, and this is quickly getting to a point where we need automated tools to track this.