bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

integrate with DataVerse #269

Closed jhpoelen closed 7 months ago

jhpoelen commented 7 months ago

related to https://github.com/IQSS/dataverse/issues/3436

jhpoelen commented 7 months ago

example -

https://dataverse.harvard.edu/dataverse/harvard?q=fileMd5:48a76222cf5c06cb4f2d8f75cc0caa63

jhpoelen commented 7 months ago

see also https://github.com/IQSS/dataverse/issues/2038#issuecomment-145140785

jhpoelen commented 7 months ago
curl -L "https://dataverse.harvard.edu/api/search?q=fileMd5:48a76222cf5c06cb4f2d8f75cc0caa63"\
 | jq .

yields:

{
  "status": "OK",
  "data": {
    "q": "fileMd5:48a76222cf5c06cb4f2d8f75cc0caa63",
    "total_count": 1,
    "start": 0,
    "spelling_alternatives": {},
    "items": [
      {
        "name": "Auter Fine PB Replication Code.txt",
        "type": "file",
        "url": "https://dataverse.harvard.edu/api/access/datafile/2829688",
        "file_id": "2829688",
        "published_at": "2016-05-18T17:57:24Z",
        "file_type": "Plain Text",
        "file_content_type": "text/plain",
        "size_in_bytes": 2065,
        "md5": "48a76222cf5c06cb4f2d8f75cc0caa63",
        "checksum": {
          "type": "MD5",
          "value": "48a76222cf5c06cb4f2d8f75cc0caa63"
        },
        "file_persistent_id": "doi:10.7910/DVN/TGKZ2T/Y1ZZXT",
        "dataset_name": "Replication Data for: \"Negative Campaigning in the Social Media Age: Attack Advertising on Facebook\"",
        "dataset_id": "2829686",
        "dataset_persistent_id": "doi:10.7910/DVN/TGKZ2T",
        "dataset_citation": "Auter, Zachary, 2016, \"Replication Data for: \"Negative Campaigning in the Social Media Age: Attack Advertising on Facebook\"\", https://doi.org/10.7910/DVN/TGKZ2T, Harvard Dataverse, V1, UNF:6:LSx44nECMNQun46yUutUuA== [fileUNF]"
      }
    ],
    "count_in_response": 1
  }
}
jhpoelen commented 7 months ago

with a list of endpoints available via -

curl "https://iqss.github.io/dataverse-installations/data/data.json"\
 | jq --raw-output .installations[].hostname

yielding

abacus.library.ubc.ca
dataverse.theacss.org
dataverse.ada.edu.au
dadosdepesquisa.fiocruz.br
dataverse.asu.edu
data.aussda.at
bonndata.uni-bonn.de
borealisdata.ca
dataverse.bhp.org.bw
data.brin.go.id
dataverse.cbpf.br
opendata.cesa.edu.co
dataverse.cidacs.org
data.cifor.org
data.cimmyt.org
dataverse.cirad.fr
science-data.hu
dataverse.csuc.cat
datasets.coronawhy.org
data.crossda.hr
researchdata.cuhk.edu.hk
dados.ipb.pt
archaeology.datastations.nl
ssh.datastations.nl
dare.uol.de
dataverse.dartmouth.edu
darus.uni-stuttgart.de
dataverse.ird.fr
data.sciencespo.fr
datarepositorium.sdum.uminho.pt
dataspace.ust.hk
edatos.consorciomadrono.es
dataverse.nl
dataverse.no
dataverse.rhi.hi.is
dorel.univ-lorraine.fr
researchdata.ntu.edu.sg
dunas.ua.pt
edmond.mpdl.mpg.de
dataverse.fgv.br
dataverse.fiu.edu
dvn.fudan.edu.cn
dataverse.orc.gmu.edu
data.univ-gustave-eiffel.fr
data.goettingen-research-online.de
dataverse.harvard.edu
heidata.uni-heidelberg.de
repositoriopesquisas.ibict.br
dataverse.icrisat.org
dataverse.mpi-sws.org
dataverse.ifdc.org
datasets.iisg.amsterdam
indata.cedia.edu.ec
dataverse.pushdom.ru
data.cipotato.org
dataverse.ipgp.fr
dataverse.iit.it
archive.data.jhu.edu
dataverse.jpl.nasa.gov
data.fz-juelich.de
keen.zih.tu-dresden.de
rdr.kuleuven.be
dataverse.lib.virginia.edu
lida.dataverse.lt
dataverse.acg.maine.edu/dvn
data.mel.cgiar.org
researchdata.nie.edu.sg
dataverse.nioz.nl
dataverse.lib.nycu.edu.tw
portal.odissei.nl
dataverse.uclouvain.be
dataverse.openforestdata.pl
osnadata.ub.uni-osnabrueck.de
papyrus-datos.co
opendata.pku.edu.cn
datos.pucp.edu.pe
data.qdr.syr.edu
entrepot.recherche.data.gouv.fr
redape.dados.embrapa.br
dataverse.unr.edu.ar
datos.uchile.cl
datos.unlp.edu.ar
research-data.urosario.edu.co
datav.udec.cl
repositoriodedados.unifesp.br
dataverse.ufabc.edu.br
dataverse.ileel.ufu.br
repositorio.polen.fccn.pt
dadosabertos.rnp.br
solo.mapbiomas.org
rodbuk.pl
agh.rodbuk.pl
pk.rodbuk.pl
uek.rodbuk.pl
uj.rodbuk.pl
dataverse.rsu.lv
data.scielo.org
sodha.be
datahub.tec.mx
dataverse.tdl.org
planetary-data-portal.org
dataverse.ucla.edu
dataverse.lib.unb.ca
dataverse.unc.edu
dataverse.lib.umanitoba.ca
dataverse.unimi.it
dataverse.vtti.vt.edu
data.worldagroforestry.org
jhpoelen commented 7 months ago

where the installation info comes from a crowd sourced google sheet at - https://docs.google.com/spreadsheets/d/1bfsw7gnHlHerLXuk7YprUT68liHfcaMxs1rFciA-mEo/edit#gid=0

see also https://github.com/IQSS/dataverse-installations

jhpoelen commented 7 months ago

A first pass at integrating with the "DataVerse" should be available in the next upcoming Preston release.

Example 1 - query against specific DataVerse endpoint

time preston cat --remote https://dataverse.harvard.edu hash://md5/48a76222cf5c06cb4f2d8f75cc0caa63 | head 

yielded

** This file contains replication code for "Negative Campaigning in the Social Media Age: Attack Advertising on Facebook"
** Note that there are separate data files for Tables 1 and 2.  The data for the Online Appendix are found in the Table 1 dataset.

**Use Auter Fine PB Replication Data - Table 1 and Online Appendix.dta for the following models**
*Table 1 - Baseline model
nbreg negative revweek relativepos racecompetitiveness female challenger democrat teaparty oppneglag postrate if class2010==1, cluster(name)
*Table 1 - Interaction model
nbreg negative revweek relativepos racecompetitiveness relativeposXweek female challenger democrat teaparty oppneglag postrate if class2010==1, cluster(name)
*Online Appendix - Table A1 - Baseline model
reg neg_pct_avg revweek relativepos racecompetitiveness female challenger democrat teaparty oppneglag if class2010==1, cluster(name)

and took

real    0m3.031s
user    0m3.209s
sys 0m0.211s

Example 2: query against all registered dataverse endpoints

using the "magic" host - dataverse.org , Preston'll try to find all registered dataverse endpoints and ask them for some content.

time preston cat --remote https://dataverse.org hash://md5/48a76222cf5c06cb4f2d8f75cc0caa63 | head 

yields

** This file contains replication code for "Negative Campaigning in the Social Media Age: Attack Advertising on Facebook"
** Note that there are separate data files for Tables 1 and 2.  The data for the Online Appendix are found in the Table 1 dataset.

**Use Auter Fine PB Replication Data - Table 1 and Online Appendix.dta for the following models**
*Table 1 - Baseline model
nbreg negative revweek relativepos racecompetitiveness female challenger democrat teaparty oppneglag postrate if class2010==1, cluster(name)
*Table 1 - Interaction model
nbreg negative revweek relativepos racecompetitiveness relativeposXweek female challenger democrat teaparty oppneglag postrate if class2010==1, cluster(name)
*Online Appendix - Table A1 - Baseline model
reg neg_pct_avg revweek relativepos racecompetitiveness female challenger democrat teaparty oppneglag if class2010==1, cluster(name)

and took

real    1m41.738s
user    0m8.203s
sys 0m0.479s

Note that Example 2 may take a while to complete, because Preston goes down a list of about 100 servers is queried until one of them claims to have the content. Optimization may help to reduce the response time if needed.

@mbjones @mielliott @cboettig

cboettig commented 7 months ago

Really cool! is dataverse only md5 based?

jhpoelen commented 7 months ago

Really cool! is dataverse only md5 based?

Not sure, but DataVerse sure looks like a MD5-verse all over, and see https://github.com/IQSS/dataverse/issues/3354 and https://github.com/gdcc/dataverse-kubernetes/issues/68#issuecomment-543111780

jhpoelen commented 7 months ago

Turns out that there are cats in the DataVerse too . . .

preston cat --remote "https://dataverse.org" hash://md5/7d62417b5b689ed91dcd25f10c9c2132\
 > cat.jpg

who knew?

cat

jhpoelen commented 7 months ago

after fixing #270 , the following screenshot was created for content rendered via:

https://linker.bio/hash://md5/7d62417b5b689ed91dcd25f10c9c2132

Screenshot from 2023-12-14 13-25-00

jhpoelen commented 7 months ago

fyi @pdurbin et al. - great to see that DataVerse supports to query content by their content id (or content hash). Thanks for making this possible. You can find examples of usage in this issue https://github.com/bio-guoda/preston/issues/269 .

pdurbin commented 7 months ago

@jhpoelen fun! I just started a thread in our chat about Preston. Please feel free to join in.

If you'd like to present at a community call or record something for DataverseTV, please let me know!

Oh, in ea7f9b5 I see you noticed the Dataverse installation is Maine is behaving differently API-wise. This is because it's running an old version of Dataverse (pre-4.x).

jhpoelen commented 6 months ago

@pdurbin happy to present at a community call. Please let me know when, and I'll try and make room in my schedule.

pdurbin commented 6 months ago

@jhpoelen great! For now I added you to our planning doc for Feb 6. Thanks!

pdurbin commented 5 months ago

@jhpoelen Happy New Year! Are you still interested in presenting at the Dataverse community call on Feb 6th? It's at 10am eastern time.

jhpoelen commented 5 months ago

@pdurbin presenting to your Dataverse community at 2024-02-06 at 10am eastern sounds like fun! Anything in particular you are interested in? Do you need some abstract / bio for announcement?

pdurbin commented 5 months ago

@jhpoelen we aren't very formal. I just updated https://dataverse.org/community-calls to say that you'll talk about how Preston was recently integrated with Dataverse. We often record these talks and put them on DataverseTV, but it's up to you. How much time would you like? 20 minutes? Plus time for Q&A? Thanks for your interest in talking about this integration!

jhpoelen commented 5 months ago

How much time would you like? 20 minutes? Plus time for Q&A? Thanks for your interest in talking about this integration!

20 minutes plus time for Q&A sounds great! Looking forward to our discussions.

jhpoelen commented 5 months ago

@pdurbin Thanks for having me at the DataVerse Community meeting today.

You can find the slides at:

https://jhpoelen.nl/dataverse-talk-2024-02-06/#/title-slide

and

https://github.com/jhpoelen/dataverse-talk-2024-02-06

pdurbin commented 5 months ago

@jhpoelen thanks for a great presentation! I just announced that your talk is now on DataverseTV.

For now I added a placeholder description but please feel free to suggest something better here or in the spreadsheet.

Screenshot 2024-02-06 at 12 17 20 PM

jhpoelen commented 5 months ago

@pdurbin Thanks again for the engaging conversation. Great to hear the different perspectives!

For future reference, I've packaged (and signed) the slides, recording etc. in:

Poelen, J. H. (2024, February 6). A DataVerse Beyond the Internet hash://md5/e34b50213fc407892d0810dabd742b1f. Zenodo. https://doi.org/10.5281/zenodo.10626561

Can you please include this citation in the DataVerseTV page?

jhpoelen commented 5 months ago

Also @pdurbin how can I best cite DataVerse and the DataVerse Community call?

jhpoelen commented 5 months ago

Also, I noticed that the DOI in the recommended citation for:

Joshua Carp, 2014, “cat.jpg”, CarpTest, https://doi.org/10.7910/DVN/24358/N4FCVS, Harvard Dataverse, V1

no longer resolved soon after the presentation.

curl --silent -IL https://doi.org/10.7910/DVN/24358/N4FCVS\
 | tail -n8

yielded:

HTTP/2 404 
date: Tue, 06 Feb 2024 21:47:41 GMT
content-type: application/xhtml+xml;charset=UTF-8
set-cookie: AWSALB=hSbuo7C/tdgx45oOdVIxHJy34jm9VAKJRgqCUH5g6ghlNrzS95nZPPmX0uafDUtE2WGHX1umL616aHHwL/iXIC1zDE28TyyfXCgTl4sbBJv//h8MpOq7kK1rd4+4; Expires=Tue, 13 Feb 2024 21:47:41 GMT; Path=/
set-cookie: AWSALBCORS=hSbuo7C/tdgx45oOdVIxHJy34jm9VAKJRgqCUH5g6ghlNrzS95nZPPmX0uafDUtE2WGHX1umL616aHHwL/iXIC1zDE28TyyfXCgTl4sbBJv//h8MpOq7kK1rd4+4; Expires=Tue, 13 Feb 2024 21:47:41 GMT; Path=/; SameSite=None; Secure
server: Apache
set-cookie: JSESSIONID=0641e5df9563852ed001fe67bf32; Path=/; Secure

Luckily content id related to their signed citation:

Joshua Carp, 2014, “cat.jpg”, CarpTest, https://doi.org/10.7910/DVN/24358/N4FCVS, Harvard Dataverse, V1 hash://md5/7d62417b5b689ed91dcd25f10c9c2132

still yields some results via non-dataverse sources like Zenodo and linker.bio (see below).

Great to have such a good example of the dynamic internet in action. Also, I wonder what the cat do to get ejected from the dataverse . . .

e.g.,

preston cat --remote https://zenodo.org hash://md5/7d62417b5b689ed91dcd25f10c9c2132 | md5sum
[https://zenodo.org/api/r...32%22&all_versions=true] 100.0% of 17 kB at 0.11 MB/s completed in < 1 minute
[https://zenodo.org/api/r...dcd25f10c9c2132/content] 100.0% of 4 MB at 0.52 MB/s completed in < 1 minute
7d62417b5b689ed91dcd25f10c9c2132  -

and

curl --silent -L https://linker.bio/hash://md5/7d62417b5b689ed91dcd25f10c9c2132\
 | md5sum
7d62417b5b689ed91dcd25f10c9c2132  -
pdurbin commented 5 months ago

@jhpoelen hi! I fixed up the DataverseTV description. For the rest, it looks like you also posted to https://groups.google.com/g/dataverse-community/c/-n0mXap9qjg/m/XslA_jpOAAAJ

I'd rather not reply in two places. Is it ok if I pick one? 😄

Yeah, we need a better way to cite Dataverse itself. There's discussion about that here:

In short, please use this:

Gary King. 2007. “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing.” Sociological Methods and Research, 36, Pp. 173–199.

And sorry, there's no way to cite the community call. I guess I'd suggest linking to the notes: https://docs.google.com/document/d/1t0eY4mh2f2aH6yhnzfyXF9J05yUgr8A5aMDIMyuae80/edit?usp=sharing

jhpoelen commented 5 months ago

@pdurbin thanks for your update and for sharing the links. I've used your information to update the description of:

Poelen, J. H. (2024, February 6). A DataVerse Beyond the Internet hash://md5/e34b50213fc407892d0810dabd742b1f. Zenodo. https://doi.org/10.5281/zenodo.10626561

Happy to take suggestions on how to better represent and cite the great work that you and your colleagues are doing . . .

jhpoelen commented 5 months ago

Just to document the 404 page generated by the Harvard Data Verse URL

https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/24358/N4FCVS

on 12 Feb 2024

image

jhpoelen commented 5 months ago

see also https://groups.google.com/g/dataverse-community/c/-n0mXap9qjg/m/pfuk0IudAAAJ and cross-posted text below -


Hi Data-nauts, Dataversians, (How do you call folks inhabiting the DataVerse?)

Julian asked:

You wrote that this supports some of the claims you made in your talk. Could you write more about this?

In my published slides and recorded talk of the 6 Feb 2024 dataverse community call:

Poelen, J. H. (2024, February 6). A DataVerse Beyond the Internet hash://md5/e34b50213fc407892d0810dabd742b1f. Zenodo. https://doi.org/10.5281/zenodo.10626561

, I asked the questions (see also https://jhpoelen.nl/dataverse-talk-2024-02-06/#/guiding-questions):

How do you cite data? How do you look up cited data now? How do you look up cited data 40 years from now?

and proceeded to take the Harvard Kitty citation as suggested by Harvard Data Verse (HDV):

Joshua Carp, 2014, “cat.jpg”, CarpTest, https://doi.org/10.7910/DVN/24358/N4FCVS, Harvard Dataverse, V1

And less than a week later (not 40/50 years later), the (aspirationally) "Persistent Identifier" (aPID) doi:10.7910/DVN/24358/N4FCVS minted by the HDV no longer resolves (see attached screenshot) as if the kitty never existed.

https://doi.org/10.7910/DVN/24358/N4FCVS redirected to https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/24358/N4FCVS which caused a 404

I know that this a sample size of N=1, but it does support my claim made later in the presentation (also see https://jhpoelen.nl/dataverse-talk-2024-02-06/#/how-to-retrieve-this-cat-picture-50-years-from-now):

How To Retrieve This Cat Picture 50 Years From Now?

Joshua Carp, 2014, “cat.jpg”, CarpTest, https://doi.org/10.7910/DVN/24358/N4FCVS, Harvard Dataverse, V1

Likely will not work due to intricate network of dependencies.

Also, note that the signed citation (as proposed in my presentation):

Joshua Carp, 2014, “cat.jpg”, CarpTest, https://doi.org/10.7910/DVN/24358/N4FCVS, Harvard Dataverse, V1 hash://md5/7d62417b5b689ed91dcd25f10c9c2132

Allows for retrieving the cat picture via their digital fingerprint hash://md5/7d62417b5b689ed91dcd25f10c9c2132 :

https://linker.bio/hash://md5/7d62417b5b689ed91dcd25f10c9c2132

preston cat --remote https://linker.bio,https://dataverse.org hash://md5/7d62417b5b689ed91dcd25f10c9c2132

while leaving open other known, or as of yet unknown, methods to retrieve published digital data via their signature.

I hope this message helps to support that the case of the lost Harvard Kitty provides evidence to support my claim that our current way of citing (and resolving) digital datasets may need a little work beyond including aPIDs to help carry our digital knowledge into the future.

Curious to hear your thoughts,

-jorrit https://jhpoelen.nl

PS. I've attached a copy of the Harvard Kitty just to have another place to be able to retrieve the cute 4.5MB cat picture. harvard-kitty.jpg