IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Create metadata blocks for CAFE's collection of climate and geospatial data #232

Closed jggautier closed 9 months ago

jggautier commented 11 months ago

This GitHub issue is being used to track progress of the creation of a metadata block or metadata blocks I'm helping design for a Dataverse collection that the BUSPH-HSPH Climate Change and Health Research Coordinating Center (CAFE) will be managing on Harvard Dataverse. Their unpublished collection is at https://dataverse.harvard.edu/dataverse/cafe.

In this repo at https://github.com/IQSS/dataverse.harvard.edu/tree/master/metadatablocks, I've added the .tsv and .properties files that define the metadata fields, and I'll continue updating those files as the CAFE folks review and improve the metadata fields.

This screenshot shows the metadata block we're planning to add, as of 2023-11-07, so that depositors can describe the geospatial data: cafedatalocation

This screenshot shows the metadata block we're planning to add, as of 2023-11-07, so that depositors can describe the source datasets of the dataset being deposited: cafedatasources

jggautier commented 11 months ago

I'd also like to use this GitHub issue to record the concerns/risks with this effort, similar to how we noted in other GitHub issues that metadata fields in metadata blocks created for other collections in HDV serve purposes that overlap with fields that are already available, such as fields in the Citation metadata block, and that facilitate describing data that others in the community have expressed interest in, like the metadata block for 3D Data discussed in https://github.com/IQSS/dataverse.harvard.edu/issues/144.

Metadata added in custom metadata block won't be in most metadata exports I spoke with the CAFE collection administrators about how the metadata added in these new metadata blocks won't be included in most metadata exports and won't be used to make the datasets more discoverable in other systems, such as search engines. This is the case with all "custom" metadata blocks we've added for collections in Harvard Dataverse.

Showing or hiding fields based on what's entered in other fields so that depositors see only relevant fields We talked about how Dataverse has no way to show or hide fields based on what's entered in other fields, which is what they wanted to do for the first field in both metadata blocks so that depositors see only relevant fields.

Those first two fields are dropdown menus where the options are "Yes" and "No". So if a depositor chooses "No" for the "Geospatial File Type" field, depositors shouldn't enter metadata in the other fields that describe a geospatial file, since there isn't a geospatial file. Since Dataverse will always show all of the fields, the CAFE folks plan to address this with instructions in a dataset template and/or training.

Letting depositors type in and enter a term in a field that uses a vocabulary We talked about how if depositors want to enter their own term for fields that include a vocabulary, such as the "Spatial File Type" field, they'll need to choose the dropdown menu's "Other" option, and type their term in the "Other Spatial File Type" field, which is always shown whether or not the depositor chooses "Other" in the first field. We've used this pattern for a field in the Life Sciences metadata block and in other custom metadata blocks in HDV.

The external controlled vocabulary mechanism handles this in a more common and arguably better way by using a UI component that let's depositors choose a term from a vocabulary and also enter their own terms in the same field. But this mechanism works only for vocabularies hosted externally and not for vocabularies that are defined in metadata block TSV files.

Custom metadata block about data location versus geospatial metadata block that ships with Dataverse The collection's administrators wanted to add fields to the geospatial metadata block that ships with Dataverse. Because it would take more time than they have to do that, we agreed to create this new metadata block for the CAFE collection, instead. They're interested in joining the Dataverse community's discussions about improving how depositors describe geospatial data and I'll need to connect them with @pdurbin and others who've worked on this.

Describing geospatial files in the dataset-level metadata Collection administrators expect that each deposit will include either no geospatial file or only one geospatial file, which these metadata fields will describe. @cmbz has included this use case with others being collected to support the need for improving Dataverse's ability to record file-level metadata.

Overlap among fields in the "Metadata Block About Data Sources" and fields in Citation metadata block We talked about how the fields in the "Metadata Block About Data Sources" overlap with the "Related Dataset" and "Data Source" fields. They planned to hide those "Related Dataset" and "Data Source" fields so that depositors aren't confused, and because they expect depositors to need to use only the fields in the custom metadata block to describe a source dataset that they used when producing their deposit.

I also mentioned that once Dataverse can send metadata about related resources to DataCite (https://github.com/IQSS/dataverse/issues/5277), we'll need to think about if and how to include the related datasets described in their custom metadata block.

Automatic layout of child fields might make it hard for depositors to fill fields the way we expect We talked about how the automatic layout of the child fields might confuse depositors. For example, depositors need to understand the relationship between the "Type" and "Other Type" fields in the "Metadata Block About Data Sources", since depositors are asked to use the "Other Type" field to add a term that isn't in the list of terms in the dropdown of the "Type" field. But in the UI, there's no visual indication that these fields rely on each other, other than the names of the fields.

281077681-502aa68d-65ea-4215-9053-3d77c00084a0

We've seen and talked about how this design also confuses depositors who use other compound fields like the Related Publication fields in the Citation metadata block. There's related discussion in https://github.com/IQSS/dataverse/issues/5277.

Metadata in "Metadata Block About Data Sources" is hard to read when viewing metadata on dataset page We talked about how when the metadata is displayed on the dataset page, it's hard to read. This is discussed more in https://github.com/IQSS/dataverse/issues/6589.

cmbz commented 11 months ago

2023/11/13

landreev commented 10 months ago

@jggautier Just to confirm - am I installing both customCAFEDataLocation.tsv and customCAFEDataSources.tsv in prod.?

jggautier commented 10 months ago

Ah yes. That CAFE collection's managers would like both of those metadata blocks available for the collection. I'll update this issue's title.

landreev commented 10 months ago

I looked into this briefly, and I'm wondering if it would be better for new blocks to go through more of a qa process, like we do with everything else, before deploying them in prod.

My biggest questions/concern were with the GeoSpatialResolution fields in these blocks, since we just had to spend so much effort addressing issues with similar fields in the Geospatial block.

There may be some similar issues with validation for the values in this block. Namely, the values are defined as floats. So it is impossible to enter anything that does not parse as a decimal fractional. This would be the right behavior when "Decimal degrees" is selected in the "Unit" pulldown. But you can also select “Degrees-minutes-seconds” in the same pulldown - but it is then impossible to enter such a value:

Screen Shot 2023-12-14 at 12 06 43 PM

(there are other notations for formatting "degrees-minutes-seconds" values of course, but none of them will parse as a valid decimal fractional).

I feel like if we want this field to support all the notations listed, the only way to achieve that would be to switch it back to text, and add custom validation methods, like we did with the Geospatial block fields.

landreev commented 10 months ago

I was also told that there were some technical issues with bringing up a test instance for the researchers involved to experiment with. But I feel like that part must be something we can figure out.

jggautier commented 10 months ago

@sbarbosadataverse, it was agreed to continue testing this and the other metadata block after they were added to Harvard Dataverse.

But can we bring up this concern, about validation, with the collection's manager Keith, and ask if user testing can be done before these metadatablocks are added?

landreev commented 10 months ago

We had a quick chat about this on slack. It sounded like I should clarify what I said above:

like we do with everything else, before deploying them in prod.

By "like we do with everything else" I didn't meant literally the same process as how we QA dev. issues - deploying it on dataverse-internal, have the same QA person test it, etc. etc. I meant more like the same idea of testing and confirming that everything works properly before trying it in prod. It sounds like this should be a somewhat different process for custom blocks - focused more on letting the researchers who requested the block do the testing and confirming that everything works the way they like.

landreev commented 10 months ago

@sbarbosadataverse, it was agreed to continue testing this and the other metadata block after they were added to Harvard Dataverse.

But can we bring up this concern, about validation, with the collection's manager Keith, and ask if user testing can be done before these metadatablocks are added?

Basically, rather than using the production for testing these blocks, let's let the collection admin(s) experiment with them on a test instance.

jggautier commented 10 months ago

A Dataverse instance was spun up with the metadata blocks (thanks @landreev!) and I gave the collection manager the URL so that he can review the fields and the concerns that Leonid brought up and potentially review with others who will be helping manage or deposit data in the collection.

There's a question about if it's possible to create the instance with https so that things entered in the site are secure. It's possible they thought they'd have to log in or create an account using the same credentials they use on Harvard Dataverse. I'm waiting to learn more about why it's needed and if it's possible or worth the trouble.

We'll probably pick this up next week.

jggautier commented 9 months ago

Nevermind about the creating the test instance with https instead of http. The repository manager was able to create an account and review the new fields.

We're emailing this week about changes to a few of the fields and I'll describe those as the work ramps back up after the holiday break.

landreev commented 9 months ago

OK, please let me know if you need anything else from me and/or if the block is ready to be installed in prod.

jggautier commented 9 months ago

Thanks. I just updated the customCAFE files in the metadatablocks directory of this repo.

I haven't been able to see if the changes I've made to the TSV and .properties files result in what the collection admin and we expect to see in the UI (since I haven't been able to spin up branches with custom metadata blocks on AWS and more recently started seeing error messages that prevent me from creating a local instance using Docker).

@landreev, could you install both metadata blocks on Harvard Dataverse when you have time? Then I'd let the collection admin know so they can make sure everything is working as expected before they start creating datasets.

Summary of most recent changes and related feedback about how Dataverse collects metadata So that there's a record of the changes we've made since this last review with the collection admin and a record of what we learned about how Dataverse collects metadata, I'm writing about them here, too:

We adjusted the Spatial Resolution fields:

We also rearranged most of the fields in the "Metadata About Data Sources" metadata block. Here's what it used to look like:

281077681-502aa68d-65ea-4215-9053-3d77c00084a0

Each of that compound field's 16 child fields have been made into primitive fields instead, so that the metadata is easier to read when it's displayed on the dataset page. https://github.com/IQSS/dataverse.harvard.edu/issues/166 describes a similar change we made to another metadata block.

Here's a mockup of what those 16 primitive fields should look like:

Screenshot 2024-01-08 at 10 50 51 AM

These design constraints with the number of child fields in a compound, described more in my earlier comment in this GitHub issue and at https://github.com/IQSS/dataverse/issues/6589, keep coming up. And until we're able to seriously think about how to address them, I'm leaning towards making it a recommendation that we don't create compound fields that have more than four child fields. I was asked to host a webinar this year about creating metadata blocks, and I'll probably include this recommendation in that webinar.

This change from a compound field to primitive fields also means that people depositing into the CAFE collection can include only one "source" dataset. The collection admin let us know that most depositors will need to include only one, so they hope it's not an issue, and that they can use the other fields in the Citation metadata block to describe other related datasets.

We're also no longer able to make some of these fields required if the depositor does have a source dataset to describe. When the fields were part of a compound field, we used the "conditionally required" functionality to to make some fields required. But now, depositors can choose any or none of these 16 fields if they indicate that their deposit was derived from another dataset.

However, before we ditched the compound field in the "Metadata About Data Sources" metadata block in favor of primitive fields, the collection admin wrote the following about their confusion with the messaging in the UI associated with the compound field's conditionally required fields:

many of the question marks include the text "This field will become required if you choose to enter values in one or more of the optional fields." I didn't understand what this meant until I submitted the test metadata and it flagged the blank fields as incomplete. Could we rephrase this to make it clearer? Ideally these fields would get red asterisks if the user selects "yes" to the first question [from the "Derived from Another Dataset" field]. But alternatively, it could just say "this field is required if you selected 'yes' to 'derived from another dataset'" (or similar).

I let them know that:

And lastly, we were able to learn more about why the collection admin is asking depositors to enter so much information about the source dataset. In an earlier comment in this GitHub issue, I wrote about the overlap among this "Metadata About Data Sources" metadata block, other fields in the Citation metadata block, and the Dataverse community's eagerness to improve how repositories collect and distribute information about research objects related to what's being deposited, something also being discussed with other repositories that make up the NIH GREI group. And we usually prioritize the other research object's persistent ID or URL and how that object is related to what's being deposited.

The collection admin explained that "Users may upload something like county-level estimates of daily air temperature, which could have been derived from a number of different sources. Some users will prefer only to work with data derived from particular datasets (or types of datasets), so we need that information in the metadata so that it will appear if users search for it."

So while some metadata of research objects, like titles, authors, versioning and dates, are relatively reliably accessible in other systems, such as how DataCite maintains and gives access to metadata about resources that it's registered DOIs for, Dataverse needs to be able to include this information in searches, so that if someone searches for the title of a source dataset, the datasets that use that source dataset are returned in search results. This is related to the discussions I hear sometimes about how systems might take advantage of an awareness of relationships among research objects, and metadata that's available in other systems, in order to improve search results.

landreev commented 9 months ago

I'll install the blocks and confirm. I haven't read everything above, yet (sorry), but I noticed this:

I haven't been able to see if the changes I've made to the TSV and .properties files result in ... since I haven't been able to spin up branches with custom metadata blocks on AWS

Is the test instance that we configured earlier still running the old version of the block? All you need to do to update the block is log in there and update the block on the command line. 3 commands max., if the solr schema update is needed. I could do that too.

If you ever need to test another block, that should be the easiest way to go - just spin up the master or default branch with the default settings, then log in and install the block once it's running.

jggautier commented 9 months ago

Thanks, yeah, a while back I tried to follow the steps in the guides about updating metadata blocks, I failed, and I spoke a bit with Don Sizemore. He tried to help, but I wasn't able to update a metadata block on the test instance on AWS, I wasn't sure why, and I mostly ran out of time trying to troubleshoot. I can try again next time. And if I run into trouble again, hopefully I can ask how you did it?

landreev commented 9 months ago

if I run into trouble again, hopefully I can ask how you did it?

Sure, of course.

landreev commented 9 months ago

The updated blocks have been installed. Please let me know if there are any problems.

jggautier commented 9 months ago

Thanks @landreev!

Looking at the fields, I realize I made mistakes updating the TSV and .properties files and I need to correct those. I can work on that today.

I asked the collection admin to review the new fields but not to save any datasets that use the fields, since that might complicate correcting them, which includes changing the database names of a couple of fields.

landreev commented 9 months ago

OK. Also, let me know if you need me to update the blocks the test instance.

jggautier commented 9 months ago

Thanks, yeah I agree it would be better to be able to do as much reviewing as possible on the test instance.

After I heard in standup yesterday about improvements to the AWS instances, I thought I'd try to update the metadata fields on that test instance today.

Mind if I try again today?

I could let you know if I run into issues, and then it'll probably be easier if you just do it.

jggautier commented 9 months ago

Hey @landreev. I'm remembering more about why it was tough for me to try to update metadata blocks in AWS instances. I edited one of the .properties files as well as both TSV files, and the steps at https://guides.dataverse.org/en/6.1/admin/metadatacustomization.html#reloading-a-metadata-block don't mention the .properties files, so I'm not sure what to do for that.

Could you update the CAFE metadata blocks in the test instance? I updated both TSV files and one of the .properties files (customCAFEDataLocation.properties).

landreev commented 9 months ago

Adding the update commands here just in case:

scp customCAFEData* to /tmp (for example) on the destination system; scp this script there as well: https://raw.githubusercontent.com/IQSS/dataverse/master/conf/solr/9.3.0/update-fields.sh

There, on the command line:

cd /tmp
curl http://localhost:8080/api/admin/datasetfield/load -H -type: text/tab-separated-values" -X POST --upload-file customCAFEDataLocation.tsv
(repeat for the other .tsv file)
sudo cp /tmp/customCAFEData*.properties /usr/local/payara6/glassfish/domains/domain1/applications/dataverse/WEB-INF/classes/propertyFiles/

Solr schema update (idk if the changes you've made actually require solr schema update - probably not? - but this would need to be done when installing a new block)

sudo service solr stop
curl "http://localhost:8080/api/admin/index/solr/schema" | sudo /tmp/update-fields.sh /usr/local/solr/server/solr/collection1/conf/schema.xml
sudo service solr start
jggautier commented 9 months ago

Oh great, thanks for the commands @landreev. After reviewing, there's another fix I need to make. I'd like to try these commands to update customCAFEDataSources.tsv. But I see that a few of the changes I made the last time aren't reflected on the test instance, and I'm not sure how to avoid this if I try to make these changes.

For example, the "Geospatial File Type" field should have been replaced by "Includes a Geospatial File," but both files exist on the test instance. And I meant to remove validation from the Spatial Resolution Value field, but it still expects a number. Here's a screenshot showing what I mean.

Screenshot 2024-01-16 at 9 10 23 AM

Something similar happened with the customCAFEDataSources metadata block.

I changed the database names of some fields in both metadata blocks. Could that be part of the issue?

Screenshot 2024-01-16 at 9 03 33 AM

And many of the fields I meant to rename in this customCAFEDataSources metadata block are still the same name. For example, "Title" should be "Source Dataset Title". I appended "Source Dataset" to many of these fields.

Lastly and unfortunately, in the CAFE collection on Harvard Dataverse where we added the metadata blocks, two datasets were published today that use the new fields, and some of these fields are the fields that need to be fixed. I disabled the two metadata blocks, told the collection admin that I disabled them, and let the admin know that the fields shouldn't be used until we can update them. The two published datasets are at https://doi.org/10.7910/DVN/Y1WNU7 and https://doi.org/10.7910/DVN/HYNJSZ.

So after we're done reviewing the metadata blocks, and update the metadata blocks that are on Harvard Dataverse, will something need to be done to fix the metadata of the these two published datasets?

landreev commented 9 months ago

I didn't restart the application on the test instance after updating the block; I thought it wasn't necessary, but let me do that now. That said, the cafeSpatialResolutionValue in your updated customCAFEDataLocation.tsv is still a float: cafeSpatialResolutionValue Value The number value indicating the horizontal grid spacing of the raster data (e.g. "30" for 30 meter x 30 meter gridded data) float 7 #NAME: #VALUE TRUE FALSE FALSE TRUE TRUE FALSE cafeSpatialResolution customCAFEDataLocation so this is the correct behavior, that it's still validated as a number. (I assumed that was what you wanted though - since the "degrees-minutes-seconds" was removed from the list of supported units, and all the other units do have to be numbers.

landreev commented 9 months ago

Please try again now?

landreev commented 9 months ago

... still looks wrong - ?

jggautier commented 9 months ago

I tried again, refreshing the page and trying different browsers, but yeah it does look wrong. Maybe we could talk with some screen sharing later today (on a Slack huddle or over Zoom)?

landreev commented 9 months ago

Did you say you changed some of the database names of the fields? (as opposed to the descriptive labels shown in the UI)

jggautier commented 9 months ago

Yeah, I changed the database names of some of the fields.

landreev commented 9 months ago

Could be a reason these fields are not updated - the update mechanism is based on matching the fields by name. (?) Let me try erasing the blocks and reinstalling again, instead of updating.

landreev commented 9 months ago

Please take a look now?

jggautier commented 9 months ago

Thanks! That worked!

Here are the remaining issues:

  1. In the "Metadata About Data Location" metadata block, the Spatial Resolution Units field is missing because I used the wrong database name when I updated customCAFEDataLocation.tsv. This is the field that should have the controlled vocab with the dropdown menu of units (where we removed "degrees-minutes-seconds").
  2. The label for the cafeSpatialReferenceSystem field should be renamed from "Spatial Reference System Name" to "Spatial Reference System".
  3. About the field validation, for the cafeSpatialResolutionValue field I meant to use the "text" field type, but like you wrote it's still float. A couple of weeks ago I wrote that we'd remove the validation altogether, but since you think the validation makes sense because we removed "degrees-minutes-seconds" from the options in the cafeSourceDataSpatialResolutionUnit field, I'll change the Spatial Resolution Unit field in the other metadata block (customCAFEDataSources.tsv), so that it expects a float, too, and I'll let the collection admin know that we'll make sure that the Spatial Resolution Unit fields in both metadata blocks expect floats.

I updated the TSV files and the customCAFEDataLocation.properties in this repo's metadatablock directory and I'll email the collection admin now about the float field validation.

If he's okay with the validation, too, then:

landreev commented 9 months ago

The block update commands above should work, but apparently only if no database names are changed. There's no clean way to uninstall a block. I did it with database queries; going forward, the recommended way should probably be to junk the instance and spin up a new one. I don't mind erasing and reinstalling this block on this instance though, just let me know when.

landreev commented 9 months ago

Trying to think of how to deal with the already published dataset(s) in prod. using these blocks. I don't think there's any other way than to erase all the existing fields associated with them; then have them re-enter the fields once the updated blocks are reinstalled.

jggautier commented 9 months ago

Okay, I'm waiting to hear back from the collection admin about adding validation to those Spatial Resolution fields so that depositors must enter floats. Although we removed "degrees-minutes-seconds" from the options in the Unit dropdown menus, one of the options is still "Other", where they would enter any unit of measurement, so I'm not sure if depositors will always add floats.

jggautier commented 9 months ago

Hey @landreev. The collection admin said the values should always be floats, so the validation is fine.

I just updated the TSV and .properties files in this repo's metadatablock directory.

Could you erase and reinstall the two metadata blocks on the test instance today?

landreev commented 9 months ago

OK, will do.

landreev commented 9 months ago

Should be done now.

jggautier commented 9 months ago

I see it. Thanks!

cmbz commented 9 months ago

2024/01/17: resized at 10 during kickoff.

jggautier commented 9 months ago

I made more small changes (field labels and display formats) to both metadata blocks. The TSV and .properties files in this repo's metadatablock directory are updated again and I think are ready to be added to the CAFE collection in Harvard Dataverse.

About the two published datasets that are using fields in the metadata blocks, you wrote:

Trying to think of how to deal with the already published dataset(s) in prod. using these blocks. I don't think there's any other way than to erase all the existing fields associated with [the metadata blocks]; then have [the depositors or curators] re-enter the fields once the updated blocks are reinstalled.

Does that mean that the datasets would be taken back to an unpublished state? I'm trying to think about what we should let the collection admin know.

landreev commented 9 months ago

Does that mean that the datasets would be taken back to an unpublished state?

I was thinking of removing all the CAFE* fields from the 2 datasets, but leaving them published with the other metadata values intact. Then, once the fixed blocks are installed, they would edit and republish them as needed.

If it's necessary to revert them to unpublished drafts, that can be done as well. We are already doing so many hacky/custom things for this effort, might as well. 

But it is necessary to remove ALL the existing CAFE fields; not just the ones that have been modified. Because we need to uninstall the blocks before we can install the fixed versions.

jggautier commented 9 months ago

Ah, I get it. I can't imagine that the collection admin would object to having to create a new minor version for each dataset. so hopefully we don't have to revert the published datasets back to unpublished drafts.

I'll email the collection admin now to ask.

landreev commented 9 months ago

These latest block changes, do they need to be applied on the test ec2 instance?

qqmyers commented 9 months ago

@jggautier FWIW: Update-current-version could also be done to avoid a new version (with appropriate permissions).

jggautier commented 9 months ago

Yes, could you apply the latest block changes to the test ec2 instance @landreev?

jggautier commented 9 months ago

Thanks @landreev for applying the changes to the ec2 test instance! I see the changes and everything's working as expected.

I'll let the collection admin know that:

Does that all sound good? I'll wait to hear back from you before I email the collection admin.

landreev commented 9 months ago

That sounds perfect. I'll wait for you to confirm that they are ok w/ all of this, before proceeding to delete any fields.

jggautier commented 9 months ago

Okay, they wrote that they're okay with it

landreev commented 9 months ago

OK, I'll look into the 2 datasets with the populated fields next. Will report.