cyfronet-fid / eosc-search-service

GNU General Public License v3.0
1 stars 1 forks source link

Prepare Solr schema for Marketplace Catalogues #1123

Closed agpul closed 9 months ago

agpul commented 10 months ago

Acceptance criteria:

This issue is blocked by: https://github.com/cyfronet-fid/marketplace/issues/3068

wiktorflorian commented 9 months ago

My proposal to fit the Figma, based on information contained in the C. v4.00 EOSC Multi-Provider Cat_409f63efbd6a44ffac0406c8c590dff5-181023-1041-6200.pdf and ruby schema here includes the output:

catalogue_output_schema = {
    "abbreviation": "string",
    "description": "string",
    "id": "string",
    "keywords": "array<string>",
    "keywords_tg": "array<string>",
    "title": "string",
    "type": "string",
    "scientific_domain": "array<string>",
    "scientific_subdomain": "array<string>",
    "structure_type": "array<string>",
    "legal_status": "array<string>",
}

Which will be converted into this Solr schema:

<field name="abbreviation" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" termPositions="true" termVectors="true" termOffsets="true" required="true" useDocValuesAsStored="true"/>
<field name="keywords" type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name="keywords_tg" type="text_general" indexed="true" stored="true" termPositions="true" termVectors="true" termOffsets="true" required="false"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="title_str" type="lowercase" indexed="true" stored="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name=“scientific_domain” type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name=“scientific_subdomain” type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name=“structure_type” type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name=“legal_status” type="strings" indexed="true" useDocValuesAsStored="true"/>

Currently, the legal_status field is missing, but it will be added in the future. However, there is no obstacle to adding it to Solr at this time.

The Figma includes an explanation for each field Screenshot 2024-01-15 at 15 11 36

NI4OS -> abbreviation National Ini... -> title Catakig -> type Scientific Domain -> scientific_domain Scientific Subdomain -> scientific_subdomain Structure Type -> structure_type Legal Status -> legal_status (right now not aviable) National Initiatives for Open Science..... -> description

I also need a decision on which fields, not included in the figma, will be used as filters and added to the schema.

The following fields, which are both in MP, are currently unused:

According to C. v4.00 EOSC Multi-Provider Catalogue Profile there should also be :

wiktorflorian commented 9 months ago

The solr schema, as provided by Figma, is as follows:

<field name="abbreviation" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="id" type="string" termPositions="true" termVectors="true" termOffsets="true" required="true" useDocValuesAsStored="true"/>
<field name="keywords" type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name="keywords_tg" type="text_general" indexed="true" stored="true" termPositions="true" termVectors="true" termOffsets="true" required="false"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name=“scientific_domains” type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name=“structure_type” type="strings" indexed="true" useDocValuesAsStored="true"/>
<field name=“legal_status” type="strings" indexed="true" useDocValuesAsStored="true"/>

The output is as follows:

catalogue_output_schema = {
    "abbreviation": "string",
    "description": "string",
    "id": "string",
    "keywords": "array<string>",
    "keywords_tg": "array<string>",
    "title": "string",
    "type": "string",
    "scientific_domains": "array<string>",
    "structure_type": "array<string>",
    "legal_status": "array<string>",
}

This is my final proposal. @agpul