Searching on DwC field datasetName

rondlg commented 4 years ago

Hi, how do I search for occurrences with a particular value in the datasetName field?

I can see the data in the record page but don't know which field it is in the advanced search list.

If that field isn't available to search on it would be really helpful if it could be added

Thanks

Sharon

rukayaj commented 2 years ago

I guess this is similar to the request made in https://github.com/gbif/portal-feedback/issues/3026

Here's some more info about our particular use case: We need to group some records in multiple datasets together, and then download them. If I do a free text search like https://www.gbif.org/occurrence/search?q=Artsprosjekt_55-12_PolyNor for "Artsprosjekt_55-12_PolyNor" (one of the project names) then I get the correct records, but no way to download them. We've been publishing the project name for each record under 'datasetName' (e.g. https://www.gbif.org/occurrence/3436305215).

This functionality is necessary for our researchers. If there's no planned work on this maybe @MortenHofft you have another idea for an additional field we can add the datasetName to which is searchable, as a work-around?

MortenHofft commented 2 years ago

I cannot really think if any beyond those mentioned in the referenced issue.

Which is essentially using publisher, institution, eventIds and collection (when appropriate of course - they shouldn't be misused just to group records that isn't in fact a collection). Or split into multiple datasets. I cannot think of any other way to group records across or within datasets.

Perhaps others can think of another approach? @ManonGros? If not and there is a strong request, then the danger is that we will see bad data that misuse e.g. collectionCode as a hack to achieve what is needed. That would be a shame.

rukayaj commented 2 years ago

Is projectID the projectID from the EML? Pity it's not on a record level... I suppose one could argue that the grouped datasets are all part of a 'collection', kind of? Is collectionCode that bad a hack do you think? I see in the definition it says 'identifying the collection or data set from which the record was derived'.

ManonGros commented 2 years ago

I cannot think of any alternative (other than the ones listed in the other issue). The collection code hack isn't ideal, especially in the context of specimen records (for observations it would make more sense).

Yes the projectID is from the EML so it is for all the records in a given dataset. This is the same problem as the networks (they include whole datasets). I suppose we could:

investigate whether projectID or networks could be at the record level (although this wasn't their intended purpose and it might be difficult to do)
or consider making the datasetName field searchable (that might be better)
or have/use a new term (I am not sure about that).

@ahahn-gbif do you have any input on the topic? (the question is how to aggregate/download records that are part of several datasets)

MortenHofft commented 2 years ago

Should it be possible to be part of multiple "projects/datasets"

ahahn-gbif commented 2 years ago

ProjectID in GBIF (and EML metadata) is presently given preference for projects run by or through GBIF (BID, BIFA, CESP and friends). The term is not (to my knowledge) defined again at record level in Darwin Core., so that the limitation is, as recognized, that a) a projectID is applied at dataset level, and that b) not more than one projectID can be assigned to the dataset. In that sense, I would advise against that choice.

Overloading any DwC term to find a work-around for some practical need is not a good idea. https://dwc.tdwg.org/terms/#dwc:datasetName is defined as "The name identifying the data set from which the record was derived.". If that is factually correct in the data, then we would not want to encourage using other terms against their actual definition.

If there is a recognized need in the community to be able to search this term through the user interface, this may be a change request. It is quite possibly not a wide-spread user demand, so that my question would be how often it is used (yearly reporting? regularly?), and by which kind of "customers". Is it possibly more an API access option that would satisfy this need?

rukayaj commented 2 years ago

I would actually think this is quite a common scenario, and that there are many field projects which go out on yearly collection trips, taking specimens which go into several collections. And then of course it's necessary for the individual projects to be able to see only their specimens.

rondlg commented 2 years ago

My 2-penneth is that it's definitely common at my institution to want to do this kind of thing and it's not easy to do right now.

There are a few things that we use the datasetName field for. Usually it is something with funding but not always:

The name of a digitization project An expedition A Research Project A Lab etc. etc.

Users have asked us how to retrieve the data associated with one or more of the above. Sometimes it's to show funders that a goal was achieved either in a single institution or across multiple institutions or we would like to be able to include/reference gbif datasets for a particular datasetName on our our web properties.

The example I give here is to our Rapid Inventories project that has been going for decades. They would like to be able to retrieve everything from a given expedition and the records cut not only across institutions but also across taxa.

Maybe this is tied up with events, I dunno but if it is we still need something simple for users and providers to work with.

I'll show my ignorance but is there a place to mint id's for projects/expeditions? If there is great, if not we are stuck with datasetName.

Our CMS allows us to record multiple projects per occurrence.

ManonGros commented 2 years ago

I think @albenson-usgs also mentioned the need for aggregating specific occurrences across datasets. If I remember correctly, the collectionCode was/is used for that purpose.

rondlg commented 2 years ago

collectionCode is a problem for us to use in this regard because it is used at a much higher level. For example to distinguish between the "Bird" collection and the "Fossil Herps" collection. These values are also unitary.

rukayaj commented 2 years ago

Should it be possible to be part of multiple "projects/datasets"

We've just had a request for this: https://github.com/gbif-norway/helpdesk/issues/90

timrobertson100 commented 2 years ago

Please see https://github.com/gbif/pipelines/issues/662 where we intend to implement multivalue dataset ID and name search capabilities shortly.

gbif / portal-feedback

Searching on DwC field datasetName #3006