cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Use of keywords by SPs - formats #425

Open cessda-bitbucket-importer opened 2 years ago

cessda-bitbucket-importer commented 2 years ago

Original report on BitBucket by John Shepherdson (GitHub: john-shepherdson).


Carsten wants to know how SPs use keywords. For each Publisher, can you get examples of the keywords declarations. For example (FSD3244):

"vocab": "ELSST",
            "vocabUri": "<https://elsst.cessda.eu/id",>
            "id": "media_literacy",
            "term": "media literacy"

No need to run exhaustive checks, but be aware that some Publishers may use 2 or more different formats/sources for their keywords. The most important variants to look out for are references to ELSST.

Run some queries against ElasticSearch to get the details, and put the examples here:

https://docs.google.com/spreadsheets/d/17LjEhQwQ0fv2CBUBN8trc8SuyShDRmpUMYaGQc8yXNU/edit?usp=sharing

See also issue #424

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Used Elasticvue Chrome plugin to search CDC ES staging indices cmmstudy_*

Plain text queries used: "*elsst" 5,194 results "\\"vocab\":\"ELSST" 5,208 results

Hence examples in linked document are non-exhaustive.

cessda-bitbucket-importer commented 2 years ago

Original comment by Carsten Thiel (GitHub: schildwaechter).


Can we have some reporting mechanism for Service Owner to consult with SPs to improve? Would that be useful @‌TainaFSD ?

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Usual approach is to create an issue per SP in the MDO issue tracker

cessda-bitbucket-importer commented 2 years ago

Original comment by Carsten Thiel (GitHub: schildwaechter).


Yes, but where do the issues come from?

The above seem to be some manually collected examples, you say non-exhaustive, so I am wondering what form of reporting/overview Taina would need to start creating these tickets.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


We would need statistics by SP of and its attribute use. Looking at John’s examples, I’m thinking it might be hard to get the statistics to portray the situation correctly. For instance, FSD uses ELSST for metadata in English but uses the national thesaurus for Finnish metadata. ELSST terms can be introduced to the Finnish metadata by machine, if needed.

I might be easiest if we just ask some very specific questions about keyword use from SPs, including what they have in the vocab and vocabUri attributes in their endpoint and is the legacy metadata updated in this respect. Metadata contacts are generally pretty willing to answer specific questions if put in an online Google Spreadsheet. Will be talking about this with Sharon and her team.

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


As discussed last week

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Not on critical path for release of v3.0.0

cessda-bitbucket-importer commented 2 years ago

Original comment by Kostas Papagiannopoulos (GitHub: kpapag).


Like #424, aggregating by keywords.vocab, we get: (First doc_count resembles the keywords number, second is for studies)

                    {
                        "key": "HASSET",
                        "doc_count": 393204,
                        "keywords_in_studies": {
                            "doc_count": 8622
                        }
                    },
                    {
                        "key": "YEAR",
                        "doc_count": 47049,
                        "keywords_in_studies": {
                            "doc_count": 6411
                        }
                    },
                    {
                        "key": "ELSST",
                        "doc_count": 40017,
                        "keywords_in_studies": {
                            "doc_count": 5859
                        }
                    },
                    {
                        "key": "",
                        "doc_count": 32908,
                        "keywords_in_studies": {
                            "doc_count": 5025
                        }
                    },
                    {
                        "key": "GEO",
                        "doc_count": 10246,
                        "keywords_in_studies": {
                            "doc_count": 1127
                        }
                    },
                    {
                        "key": "ELSST - The European Language Social Science Thesaurus",
                        "doc_count": 8128,
                        "keywords_in_studies": {
                            "doc_count": 1598
                        }
                    },
                    {
                        "key": "TheSoz",
                        "doc_count": 2528,
                        "keywords_in_studies": {
                            "doc_count": 528
                        }
                    },
                    {
                        "key": "MeSH",
                        "doc_count": 586,
                        "keywords_in_studies": {
                            "doc_count": 74
                        }
                    },
                    {
                        "key": "UniData Category System",
                        "doc_count": 574,
                        "keywords_in_studies": {
                            "doc_count": 124
                        }
                    },
                    {
                        "key": "ELSS Thesaurus",
                        "doc_count": 198,
                        "keywords_in_studies": {
                            "doc_count": 50
                        }
                    },
                    {
                        "key": "none",
                        "doc_count": 178,
                        "keywords_in_studies": {
                            "doc_count": 57
                        }
                    },
                    {
                        "key": "ELSST Thesaurus",
                        "doc_count": 67,
                        "keywords_in_studies": {
                            "doc_count": 16
                        }
                    },
                    {
                        "key": "European Language Social Science Thesaurus (ELSST)",
                        "doc_count": 60,
                        "keywords_in_studies": {
                            "doc_count": 5
                        }
                    },
                    {
                        "key": "European Language Social Science Thesaurus",
                        "doc_count": 55,
                        "keywords_in_studies": {
                            "doc_count": 6
                        }
                    },
                    {
                        "key": "GEMET",
                        "doc_count": 33,
                        "keywords_in_studies": {
                            "doc_count": 21
                        }
                    },
                    {
                        "key": "ALLFO",
                        "doc_count": 32,
                        "keywords_in_studies": {
                            "doc_count": 26
                        }
                    },
                    {
                        "key": "INSPIRE Spatial Data Themes",
                        "doc_count": 22,
                        "keywords_in_studies": {
                            "doc_count": 11
                        }
                    },
                    {
                        "key": "ELSST ",
                        "doc_count": 12,
                        "keywords_in_studies": {
                            "doc_count": 4
                        }
                    },
                    {
                        "key": "EnvThes",
                        "doc_count": 8,
                        "keywords_in_studies": {
                            "doc_count": 2
                        }
                    },
                    {
                        "key": "GCMD",
                        "doc_count": 6,
                        "keywords_in_studies": {
                            "doc_count": 5
                        }
                    },
                    {
                        "key": "ELLST",
                        "doc_count": 2,
                        "keywords_in_studies": {
                            "doc_count": 2
                        }
                    },
                    {
                        "key": "N/A",
                        "doc_count": 2,
                        "keywords_in_studies": {
                            "doc_count": 2
                        }
                    },
                    {
                        "key": "YSO",
                        "doc_count": 2,
                        "keywords_in_studies": {
                            "doc_count": 1
                        }
                    },
                    {
                        "key": "ELSSt Thesaurus",
                        "doc_count": 1,
                        "keywords_in_studies": {
                            "doc_count": 1
                        }
                    },
                    {
                        "key": "ICD-10",
                        "doc_count": 1,
                        "keywords_in_studies": {
                            "doc_count": 1
                        }
                    },
                    {
                        "key": "Social protection expenditure contain: social benefits, which consist of transfers, in cash or in kind, to households and individuals to relieve them from the burden of a defined set of risks or needs",
                        "doc_count": 1,
                        "keywords_in_studies": {
                            "doc_count": 1
                        }
                    },
                    {
                        "key": "The indicator is defined as the percentage of population with an enforced lack of at least three out of nine material deprivation items in the 'economic strain and durables' dimension.",
                        "doc_count": 1,
                        "keywords_in_studies": {
                            "doc_count": 1
                        }
                    }

cessda-bitbucket-importer commented 2 years ago

Original comment by Kostas Papagiannopoulos (GitHub: kpapag).


Query for results:

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "keywordsTotal": {
      "nested": {
        "path": "keywords"
      },
      "aggs": {
        "keywords_vocab": {
          "terms": {
            "field": "keywords.vocab",
            "size": 100000
          },
          "aggs": {
            "keywords_in_studies": {
              "reverse_nested": {}
            }
          }
        }
      }
    }
  },
  "size": 0
}

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


Oh dear, there is still too much variation in ‘vocab’ relating to what name for ELSST has been used. I need to start making issues.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


Most ELSST name variations seemed to be produced by two Publishers. I’ve now made issues for them.

However, could not find the studies that had

Kostas, could you find the links to these particular studies for me? Or their study number/PID?

Could not find any examples of “none” as vocab value (too many results). Kostas, could you take a look and see which Publisher uses this value? Just the name of the Publisher is enough, no need to find the specific studies.

(Just a note to Taina herself: YEAR is used by UKDS when entering the year of the study, so therefore not ELSST keywords. Not sure why they use the year as a keyword…)

cessda-bitbucket-importer commented 2 years ago

Original comment by Kostas Papagiannopoulos (GitHub: kpapag).


Hello @‌TainaFSD . Here are the results for you:

"key": "N/A"

https://datacatalogue.cessda.eu/detail?lang=en&q=ae75c56c48b830d35a2e545da5359389bd0a0c082261f73ad8e5eb6f113097a8

https://datacatalogue-staging.cessda.eu/detail?lang=en&q=9d8ab8373ffa3d85ec920ba2b50a7bf65510fac774937fe263ef0c35323feff6

"key": "ICD-10"

https://datacatalogue.cessda.eu/detail?lang=en&q=157137d08a7ed80733e17bdfbb27da1baf87c2d21986ea64fd099f96fc8e8143

"key": "The indicator is defined as the percentage of population with an enforced lack of at least three out of nine material deprivation items in the 'economic strain and durables' dimension."

https://datacatalogue.cessda.eu/detail?lang=en&q=ff2c4e5641edf5f40250d58fde6a9a9e641921672aa79347b619a36524dae9d4

"key": "Social protection expenditure contain: social benefits, which consist of transfers, in cash or in kind, to households and individuals to relieve them from the burden of a defined set of risks or needs"

https://datacatalogue.cessda.eu/detail?lang=en&q=07e997d7214b54be5fe59845b2d6b8196a51b77944b7953ea022c1877835df9c

"key": "none"

61 records. Publisher: SODHA. Example:

https://datacatalogue.cessda.eu/detail?lang=en&q=37bf647c2a2a69865dea17110571686a01ce4ad4271a0c38d0cc06fdf37cad9e


Since i’m also working for SoDaNet, i can inform the admins for the 2 records that need updating. If it’s ok with you please let me know of the correct value to set. Thanks!

cessda-bitbucket-importer commented 2 years ago

Original comment by Kostas Papagiannopoulos (GitHub: kpapag).


Here’s the code to run, for any future need.

{
  "query": {
    "nested": {
      "path": "keywords",
      "query": {
        "bool": {
          "must": [
            { "match": { "keywords.vocab": "ICD-10" } }
          ]
        }
      },
      "score_mode": "avg"
    }
  }
}

cessda-bitbucket-importer commented 2 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


@‌TainaFSD Did you see Kostas' comment re him feeding correct keyword back to SoDaNet?

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


@kpapag The vocab attribute content should be ‘ELSST’.

There are also other metadata records for SoDaNet that need fixing. I’ve made an issue about this in the metadata.office issue tracker (https://github.com/cessda/cessda.metadata.office/issues/112) and have sent Apostolos an email about this.

SoDaNet has about 70 datasets in English and more in Greek where the ‘vocab’ attribute has one of the following non-harmonised vocab values:

To be amended either in the metadata or in the endpoint.

The datasets can be found in CDC by 1) entering this query in quotation mark into the search box “European Language Social Science Thesaurus” or “ELSS Thesaurus” and 2) choosing SoDaNet as the Publisher in the the Publisher filter.

cessda-bitbucket-importer commented 2 years ago

Original comment by Taina Jääskeläinen.


The next steps for this issue:

I will chase those SPs for which I made the metadata issues for variations of ELSST name. Once they have amended their metadata/end-point, a new run can be made to see if there are still any variations.

If the vocab element contains the name of some other vocabulary, that is not a problem. The issue is particularly the variations in ELSST name since we are hoping to produce functionalities based on ELSST keywords.

I will therefore assign this issue to myself.

cessda-bitbucket-importer commented 2 years ago

Original comment by Kostas Papagiannopoulos (GitHub: kpapag).


Thank you, Taina. I will forward your comment to the SoDaNet team, asap.

cessda-bitbucket-importer commented 1 year ago

Original comment by Taina Jääskeläinen.


Related to #397.

cessda-bitbucket-importer commented 1 year ago

Original comment by Taina Jääskeläinen.


Take into account the results from the ELSST use questionnaire:

https://docs.google.com/spreadsheets/d/1mUeH2gW_eZ3taigPcjK0Z9vK8R_zKSBY2sE8uD1ZqoE/edit#gid=2128163204

Assining to myself for future reference (i.e. for assigning to new Service Owner) and for making issues in the metadata office issue tracker.

cessda-bitbucket-importer commented 1 year ago

Original comment by Taina Jääskeläinen.


Issues in metadata.office issue tracker:

SND and CROSSDA issue creation is on hold till DDI 2.6 profiles are released at some future point.

cessda-bitbucket-importer commented 1 year ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


APIS endpoint is unreachable, MoTech will contact them to discuss.