Open cessda-bitbucket-importer opened 2 years ago
Original comment by John Shepherdson (GitHub: john-shepherdson).
Used Elasticvue Chrome plugin to search CDC ES staging indices cmmstudy_*
Plain text queries used: "*elsst" 5,194 results "\\"vocab\":\"ELSST" 5,208 results
Hence examples in linked document are non-exhaustive.
Original comment by Carsten Thiel (GitHub: schildwaechter).
Can we have some reporting mechanism for Service Owner to consult with SPs to improve? Would that be useful @TainaFSD ?
Original comment by John Shepherdson (GitHub: john-shepherdson).
Usual approach is to create an issue per SP in the MDO issue tracker
Original comment by Carsten Thiel (GitHub: schildwaechter).
Yes, but where do the issues come from?
The above seem to be some manually collected examples, you say non-exhaustive, so I am wondering what form of reporting/overview Taina would need to start creating these tickets.
Original comment by Taina Jääskeläinen.
We would need statistics by SP of
I might be easiest if we just ask some very specific questions about keyword use from SPs, including what they have in the vocab and vocabUri attributes in their endpoint and is the legacy metadata updated in this respect. Metadata contacts are generally pretty willing to answer specific questions if put in an online Google Spreadsheet. Will be talking about this with Sharon and her team.
Original comment by John Shepherdson (GitHub: john-shepherdson).
Not on critical path for release of v3.0.0
Original comment by Kostas Papagiannopoulos (GitHub: kpapag).
Like #424, aggregating by keywords.vocab, we get: (First doc_count resembles the keywords number, second is for studies)
{
"key": "HASSET",
"doc_count": 393204,
"keywords_in_studies": {
"doc_count": 8622
}
},
{
"key": "YEAR",
"doc_count": 47049,
"keywords_in_studies": {
"doc_count": 6411
}
},
{
"key": "ELSST",
"doc_count": 40017,
"keywords_in_studies": {
"doc_count": 5859
}
},
{
"key": "",
"doc_count": 32908,
"keywords_in_studies": {
"doc_count": 5025
}
},
{
"key": "GEO",
"doc_count": 10246,
"keywords_in_studies": {
"doc_count": 1127
}
},
{
"key": "ELSST - The European Language Social Science Thesaurus",
"doc_count": 8128,
"keywords_in_studies": {
"doc_count": 1598
}
},
{
"key": "TheSoz",
"doc_count": 2528,
"keywords_in_studies": {
"doc_count": 528
}
},
{
"key": "MeSH",
"doc_count": 586,
"keywords_in_studies": {
"doc_count": 74
}
},
{
"key": "UniData Category System",
"doc_count": 574,
"keywords_in_studies": {
"doc_count": 124
}
},
{
"key": "ELSS Thesaurus",
"doc_count": 198,
"keywords_in_studies": {
"doc_count": 50
}
},
{
"key": "none",
"doc_count": 178,
"keywords_in_studies": {
"doc_count": 57
}
},
{
"key": "ELSST Thesaurus",
"doc_count": 67,
"keywords_in_studies": {
"doc_count": 16
}
},
{
"key": "European Language Social Science Thesaurus (ELSST)",
"doc_count": 60,
"keywords_in_studies": {
"doc_count": 5
}
},
{
"key": "European Language Social Science Thesaurus",
"doc_count": 55,
"keywords_in_studies": {
"doc_count": 6
}
},
{
"key": "GEMET",
"doc_count": 33,
"keywords_in_studies": {
"doc_count": 21
}
},
{
"key": "ALLFO",
"doc_count": 32,
"keywords_in_studies": {
"doc_count": 26
}
},
{
"key": "INSPIRE Spatial Data Themes",
"doc_count": 22,
"keywords_in_studies": {
"doc_count": 11
}
},
{
"key": "ELSST ",
"doc_count": 12,
"keywords_in_studies": {
"doc_count": 4
}
},
{
"key": "EnvThes",
"doc_count": 8,
"keywords_in_studies": {
"doc_count": 2
}
},
{
"key": "GCMD",
"doc_count": 6,
"keywords_in_studies": {
"doc_count": 5
}
},
{
"key": "ELLST",
"doc_count": 2,
"keywords_in_studies": {
"doc_count": 2
}
},
{
"key": "N/A",
"doc_count": 2,
"keywords_in_studies": {
"doc_count": 2
}
},
{
"key": "YSO",
"doc_count": 2,
"keywords_in_studies": {
"doc_count": 1
}
},
{
"key": "ELSSt Thesaurus",
"doc_count": 1,
"keywords_in_studies": {
"doc_count": 1
}
},
{
"key": "ICD-10",
"doc_count": 1,
"keywords_in_studies": {
"doc_count": 1
}
},
{
"key": "Social protection expenditure contain: social benefits, which consist of transfers, in cash or in kind, to households and individuals to relieve them from the burden of a defined set of risks or needs",
"doc_count": 1,
"keywords_in_studies": {
"doc_count": 1
}
},
{
"key": "The indicator is defined as the percentage of population with an enforced lack of at least three out of nine material deprivation items in the 'economic strain and durables' dimension.",
"doc_count": 1,
"keywords_in_studies": {
"doc_count": 1
}
}
Original comment by Kostas Papagiannopoulos (GitHub: kpapag).
Query for results:
{
"query": {
"match_all": {}
},
"aggs": {
"keywordsTotal": {
"nested": {
"path": "keywords"
},
"aggs": {
"keywords_vocab": {
"terms": {
"field": "keywords.vocab",
"size": 100000
},
"aggs": {
"keywords_in_studies": {
"reverse_nested": {}
}
}
}
}
}
},
"size": 0
}
Original comment by Taina Jääskeläinen.
Oh dear, there is still too much variation in ‘vocab’ relating to what name for ELSST has been used. I need to start making issues.
Original comment by Taina Jääskeläinen.
Most ELSST name variations seemed to be produced by two Publishers. I’ve now made issues for them.
However, could not find the studies that had
ICD-10
N/A
Kostas, could you find the links to these particular studies for me? Or their study number/PID?
Could not find any examples of “none” as vocab value (too many results). Kostas, could you take a look and see which Publisher uses this value? Just the name of the Publisher is enough, no need to find the specific studies.
(Just a note to Taina herself: YEAR is used by UKDS when entering the year of the study, so therefore not ELSST keywords. Not sure why they use the year as a keyword…)
Original comment by Kostas Papagiannopoulos (GitHub: kpapag).
Hello @TainaFSD . Here are the results for you:
"key": "N/A"
"key": "ICD-10"
"key": "The indicator is defined as the percentage of population with an enforced lack of at least three out of nine material deprivation items in the 'economic strain and durables' dimension."
"key": "Social protection expenditure contain: social benefits, which consist of transfers, in cash or in kind, to households and individuals to relieve them from the burden of a defined set of risks or needs"
"key": "none"
61 records. Publisher: SODHA. Example:
Since i’m also working for SoDaNet, i can inform the admins for the 2 records that need updating. If it’s ok with you please let me know of the correct value to set. Thanks!
Original comment by Kostas Papagiannopoulos (GitHub: kpapag).
Here’s the code to run, for any future need.
{
"query": {
"nested": {
"path": "keywords",
"query": {
"bool": {
"must": [
{ "match": { "keywords.vocab": "ICD-10" } }
]
}
},
"score_mode": "avg"
}
}
}
Original comment by John Shepherdson (GitHub: john-shepherdson).
@TainaFSD Did you see Kostas' comment re him feeding correct keyword back to SoDaNet?
Original comment by Taina Jääskeläinen.
@kpapag The vocab attribute content should be ‘ELSST’.
There are also other metadata records for SoDaNet that need fixing. I’ve made an issue about this in the metadata.office issue tracker (https://github.com/cessda/cessda.metadata.office/issues/112) and have sent Apostolos an email about this.
SoDaNet has about 70 datasets in English and more in Greek where the ‘vocab’ attribute has one of the following non-harmonised vocab values:
To be amended either in the metadata or in the endpoint.
The datasets can be found in CDC by 1) entering this query in quotation mark into the search box “European Language Social Science Thesaurus” or “ELSS Thesaurus” and 2) choosing SoDaNet as the Publisher in the the Publisher filter.
Original comment by Taina Jääskeläinen.
The next steps for this issue:
I will chase those SPs for which I made the metadata issues for variations of ELSST name. Once they have amended their metadata/end-point, a new run can be made to see if there are still any variations.
If the vocab element contains the name of some other vocabulary, that is not a problem. The issue is particularly the variations in ELSST name since we are hoping to produce functionalities based on ELSST keywords.
I will therefore assign this issue to myself.
Original comment by Kostas Papagiannopoulos (GitHub: kpapag).
Thank you, Taina. I will forward your comment to the SoDaNet team, asap.
Original comment by Taina Jääskeläinen.
Related to #397.
Original comment by Taina Jääskeläinen.
Take into account the results from the ELSST use questionnaire:
Assining to myself for future reference (i.e. for assigning to new Service Owner) and for making issues in the metadata office issue tracker.
Original comment by Taina Jääskeläinen.
Issues in metadata.office issue tracker:
SND and CROSSDA issue creation is on hold till DDI 2.6 profiles are released at some future point.
Original comment by John Shepherdson (GitHub: john-shepherdson).
APIS endpoint is unreachable, MoTech will contact them to discuss.
Original report on BitBucket by John Shepherdson (GitHub: john-shepherdson).
Carsten wants to know how SPs use keywords. For each Publisher, can you get examples of the keywords declarations. For example (FSD3244):
No need to run exhaustive checks, but be aware that some Publishers may use 2 or more different formats/sources for their keywords. The most important variants to look out for are references to ELSST.
Run some queries against ElasticSearch to get the details, and put the examples here:
https://docs.google.com/spreadsheets/d/17LjEhQwQ0fv2CBUBN8trc8SuyShDRmpUMYaGQc8yXNU/edit?usp=sharing
See also issue #424