IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

What are the allowed search fields for the Search API q parameter? #2558

Closed leeper closed 3 weeks ago

leeper commented 8 years ago

I'm looking at the Search API Docs. What are the allowed fields for the q parameter? It appears to include the list of Dataverse DB Elements mentioned in the metadata crosswalk but it also appears to include other fields not listed there. Is there a complete list? And can the documentation be updated accordingly?

pdurbin commented 8 years ago

@leeper it depends! :) @markwilkinson asked about this too, as I mentioned in #2291 .

At the very least, I could document the fact that the fields supported by an installation of Dataverse 4 depend on which domain-specific metadata schemas (metadata blocks) have been enabled. http://guides.dataverse.org/en/4.1/user/appendix.html#metadata-references contains a list as of 4.1 but there are other site-specific ("custom") metadata blocks used only by Harvard as of this writing. All metadata blocks are stored as TSV files and then loaded into the system at installation time: https://github.com/IQSS/dataverse/tree/v4.1/scripts/api/data/metadatablocks . When we update these tsv files, we add them to the list of data-driven fields we index into Solr: https://github.com/IQSS/dataverse/blob/v4.1/conf/solr/4.6.0/schema.xml#L328 . You'll see references to the "custom" Harvard-specific blocks like GSD and PSI in that Solr schema config.

Parsing those TSV files is a little rough (#2551) and I wouldn't wish it on any API user so perhaps we should allow API users to interrogate a running Dataverse installation for a list of supported metadata fields. I can imagine this being part of the Search API itself. Maybe you call into /api/search/fields or something...

I recently stumbled upon the fact that I can go to https://dataverse.harvard.edu/api/metadatablocks to find a list of metadata blocks as documented at http://guides.dataverse.org/en/4.1/api/native-api.html#metadata-blocks but I didn't quickly find how to list the fields within each metadata block. I did add an "admin-only" API endpoint which I mentioned at https://github.com/IQSS/dataverse/issues/2357#issuecomment-121677178 that lets me list all the fields from http://localhost:8080/api/admin/datasetfield but the output needs a lot of work. Also, that API endpoint only shows the data-driven fields, not the static ones in SearchFields.java I mentioned in #2291. (At some point we'll probably want to change these static fields to be fed from the database for #2039 .)

Oh, and some sensitive fields such as for email addresses aren't indexed for privacy reasons per #759 .

Going to an Advanced Search Page such as https://dataverse.harvard.edu/dataverse/harvard/search for the root dataverse can be a help in figuring out which fields are searchable but as #2353 notes right now you can't see the domain-specific metadata blocks at the root. I mention this because different blocks can be enabled at different dataverses within the tree of dataverses in a single Dataverse installation. So maybe when you ask the Search API for a list of supported fields you could supply the dataverse of interest and it will tell you which metadata blocks are enabled. Or rather, it would tell you the search fields that are available based on the metadata blocks enabled from that dataverse (i.e. social science vs. astronomy).

@leeper I'm sure this is way more information than you wanted! Thanks for opening this issue. :)

To sum up, I can at least improve the Search API documentation a bit. I should probably add something to the Search API so that API users can simply get a list of fields they can search on, perhaps with respect to where in the tree of dataverses they are searching (the root dataverse vs. a subdataverse).

pdurbin commented 8 years ago

@leeper I looked at the code and played around with the already existing "GET http://$SERVER/api/metadatablocks/$identifier" endpoint documented at http://guides.dataverse.org/en/4.1/api/native-api.html#metadata-blocks

Perhaps you and @markwilkinson and anyone else interested in knowing which fields are supported could play around with this metadatablocks API endpoint and give us feedback on it. It looks like it was developed by @michbarsinai and it seems quite useful. Here's how I can imagine it being used:

Get a list of metadata blocks that are enabled

curl -s https://apitest.dataverse.org/api/metadatablocks | jq .data[].name -r

citation
geospatial
socialscience
astrophysics
biomedical
journal

For each of the metadata blocks, show the fields

curl -s https://apitest.dataverse.org/api/metadatablocks/citation | jq . | head -20

{
  "status": "OK",
  "data": {
    "id": 1,
    "name": "citation",
    "displayName": "Citation Metadata",
    "fields": {
      "title": {
        "name": "title",
        "displayName": "Title",
        "title": "Title",
        "type": "TEXT",
        "watermark": "Enter title...",
        "description": "Full title by which the Dataset is known."
      },
      "subtitle": {
        "name": "subtitle",
        "displayName": "Subtitle",
        "title": "Subtitle",
        "type": "TEXT",
...

In the output above the field to search on is listed under "name" such as "title" or "subtitle".

Of course, these are only the data-driven fields at the dataset level, not the static fields in SearchFields.java I mentioned, but some of those fields are aren't searchable by design (though we recently made more of them searchable as part of #2038).

markwilkinson commented 8 years ago

Thanks for the update! :-)

Mark

leeper commented 8 years ago

@pdurbin Excellent! This response is a lot to parse! I'll take a look and see what I can do. I guess the minimum solution is to provide a flexible interface and then I can build on features that help tailor use of the API when there are known metadata schemes. Being able to query what those are for any particular installation would definitely be a helpful feature of the search API.

pdurbin commented 7 years ago

1510 is related in the sense that people don't know what subjects are allowed when creating a dataset (and it's a required field).

pdurbin commented 5 years ago

In pull request #6107 I at least linked back to this issue so API users can get a sense of how they can know what the allowed search fields are. Here's the commit: d3a5b2f

If anyone wants to help with an actual solution to this issue, I'm happy to mentor them. I'm thinking that for now we could just list the "out of the box" fields in the API Guide.

Jerry-Ma commented 2 years ago

Hi, I was trying to find a reference to the query string for searching particular files. However it seems the above discussion is more about searching dataverse/dataset metadata. Could anyone point me to the place that show the key we could use for searching files?

The use case for me is that I am creating a script uploading files to a dataset via the API, and I would like to check if a particular file named "foo" (with filepath foo.txt) already exists in a certain dataset of global_id="doi:10.5072/FK2/J9EK29", which is within the dataverse identified by id="bar"

So far I the furthest point I've got is the following:

api/search?q=fileName:foo&type=file&subtree=bar&sort=date&order=desc.

There are a couple of issues with this:

qqmyers commented 2 years ago

The DVUploaderhttps://github.com/GlobalDataverseCommunityConsortium/dataverse-uploader uses the /api/datasets/:persistentId/versions/:latest/files call to get the list of files in a dataset. You might also want to look at pyDataversehttps://github.com/gdcc/pyDataverse. Both of these tools might be things you could use to upload files but they also would show you the api calls you might want to make.

-- Jim

Sent from Mailhttps://go.microsoft.com/fwlink/?LinkId=550986 for Windows

From: Zhiyuan @.> Sent: Tuesday, April 12, 2022 9:55 AM To: @.> Cc: @.***> Subject: Re: [IQSS/dataverse] What are the allowed search fields for the Search API q parameter? (#2558)

Hi, I was trying to find a reference to the query string for searching particular files. However it seems the above discussion is more about searching dataverse/dataset metadata. Could anyone point me to the place that show the key we could use for searching files?

The use case for me is that I am creating a script uploading files to a dataset via the API, and I would like to check if a particular file named "foo" (with filepath foo.txt) already exists in a certain dataset with known global_id="doi:10.5072/FK2/J9EK29", which is un dataverse id="bar"

So far I the furthest point I've got is the following:

api/search?q=fileName:foo&type=file&subtree=bar&sort=date&order=desc.

There are a couple of issues with this:

— Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/2558#issuecomment-1096762586, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABTLRT64M6D3SYKHUGUYP53VEV6EVANCNFSM4BQB7RYA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jerry-Ma commented 2 years ago

@qqmyers

Thank you for the links. I'll take a look at DVUploader in detail. The direct upload with storage identifier is gonna also be useful for our use case, because we also have our own storage service (not amazon S3).

I am already using pyDataverse for creating datasets and uploading datafiles. It works great so far for me, but lacks certain logics (like checking if datafile already exists, etc) that I need to implement on my own. The repo of my mentioned workflow is here: https://github.com/toltec-astro/dvpipe.

Just a bit background, this effort is part of the software infrastructure that we are building for the Large Millimeter Telescope. We have setup a dataverse instance at https://dp.lmtgtm.org and plan to use it as the main channel to distribute the data products produced by the software pipelines that reduces the data taken by various instruments on the LMT. This dvpipe is to be the automation pipeline that packages the data reduction pipeline outputs and sends them to the dataverse server.

qqmyers commented 2 years ago

Nice! (I was involved with the Dark Energy Survey telescope data management project a f ew years ago.) W.r.t. pyDataverse, it@skasberger is open to pull requests so if there is logic you think should go there, please consider adding to it. (In particular, it would be great to get the direct upload capabilities in there.)

pdurbin commented 2 years ago

I would like to check if a particular file named "foo" (with filepath foo.txt) already exists in a certain dataset of global_id="doi:10.5072/FK2/J9EK29", which is within the dataverse identified by id="bar"

The approach suggested by @qqmyers to download the list of files is probably the most reliable but I thought I'd chime in specifically about the Search API question above.

@Jerry-Ma you can search against the parentIdentifier field with the DOI of the dataset like this:

https://dataverse.harvard.edu/api/search?q=name:2019-02-25.tab&fq=parentIdentifier:doi\:10.7910/DVN/TJCLKP

Please note:

Also, if you'd like to include your installation on our map, please feel free to open an issue at https://github.com/IQSS/dataverse-installations !

cmbz commented 3 weeks ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.