geolexica / jekyll-geolexica

Geolexica using Jekyll
2 stars 0 forks source link

Use glossarist for reading yaml files #14

Closed HassanAkbar closed 7 months ago

HassanAkbar commented 1 year ago

we should use glossarist for reading concept yaml files as mentioned here -> https://github.com/geolexica/jekyll-geolexica/issues/12#issuecomment-1662015937

HassanAkbar commented 1 year ago

It is mentioned here https://github.com/geolexica/jekyll-geolexica/issues/12#issuecomment-1662015937

I think this makes sense:

  1. start supporting authoritativeSource,
  2. support current Glossarist 2 model in glossarist-ruby,
  3. switch jekyll-geolexica to use glossarist-ruby with thorough testing on existing website repositories to ensure no undesired changes happen when they are later deployed.

In that order…

@opoudjis @ronaldtse Do we want to use Glossarist model v2 in glossaries-ruby for this or can we start working on this with v1 ?

ronaldtse commented 1 year ago

@HassanAkbar we want to migrate all files to Glossarist model v2. Can we do it for all our existing repositories?

HassanAkbar commented 1 year ago

@ronaldtse Currently glossarist-ruby does not support Glossarist model v2 and if we are migrating every repo to v2 then we need to update glossarist-ruby as well and currently I am not sure what V2 is.

@strogonoff I see that you have worked on updating the isotc211-glossary to V2. Can you help me understand the structure of V2 glossary?

ronaldtse commented 1 year ago

In this case there is no v2 and we should get this working with v1!

HassanAkbar commented 1 year ago

@ronaldtse @strogonoff Currently the glossary for isotc211-glossary is in V2 and glossary for osgeo-glossary is in V1 so if we go with V1 then isotc211 will break.

HassanAkbar commented 12 months ago

We can update the Jekyll-geolexica version and use the Glossarist model V2 in the updated version and for the sites that are using Glossarist model V1 we can keep using the old version of Jekyll-geolexica.

@ronaldtse The important thing is that we will drop the support of Glossarist model V1 in Jekyll-geolexica. We won’t be able to make any changes in the older version of Jekyll-geolexica and we should be prioritizing the update of sites to use V2 to get any modifications/features/bugs done.

ronaldtse commented 11 months ago

The important thing is that we will drop the support of Glossarist model V1 in Jekyll-geolexica

That's fine to me because we control all those repositories right now. We should bring all of those repositories up to date as soon as possible.

HassanAkbar commented 11 months ago

@ronaldtse @strogonoff I have a few questions related to Glossarist model V2,

strogonoff commented 11 months ago

In more detail:

stefanomunarini commented 11 months ago
  • In case of that YAML, data contains many extra fields that are not in Glossarist model. That’s likely a mistake. I think the YAML you are seeing is probably output by some data conversion script that doesn’t follow the models as intended.

    • review* fields are not supposed to be there
    • dates list is not supposed to be there

I can update the script to delete excess data, as for above.

  • It probably also outputs wrong dateAccepted.

@strogonoff are we talking about the output of a concept or of a localized-concept? In the case of the first, we are setting its value to a dummy date that we retrieve from config.py. In the case of the latter, we retrieve its value from the data itself, if present, or use the same default value as for the first, if not.

How could this be improved?

ronaldtse commented 11 months ago

Just for the record, we have 2 families of models:

  1. Glossarist models: represents a concept and related things
  2. Register models: represents a register and related things (such as a register item)

In this case, it happens that the Glossarist dataset is managed by a Register. This means that every Glossarist Concept is also a Register Concept (in the new ISO 19135 under development, but in the old version currently it is a Register Item).

It happens that in ISO/TC 211, they use the old Register model which means that every Concept is a Register Item, and that each concept in the MLGT (the content on isotc211.geolexica.org) is accompanied by some status dates such as "approval date" (and this content is at both the general concept level and the localized concept level).

In an ideal world, the data for the Glossarist models (data content) is separate from the Register models (administrative content). The Register models can refer to the Glossarist models, of course and vice versa. This way we could use different parsers/models accessors to work with the data:

HassanAkbar commented 11 months ago

@ronaldtse So, for Glossarist we should only read the data inside the data key and discard other keys in the yaml file. Also I think we should discard the Register data in Glossarist. Should we create a separate gem for that or is there an existing gem that we can use?

HassanAkbar commented 11 months ago

As mentioned by @strogonoff

In case of that YAML, data contains many extra fields that are not in Glossarist model. That’s likely a mistake. I think the YAML you are seeing is probably output by some data conversion script that doesn’t follow the models as intended.

  • review* fields are not supposed to be there
  • dates list is not supposed to be there

@ronaldtse One more question related to this, Should I assume that these will be fixed in isotc211-glossary or should I add these fields temporarily in Glossarist?

stefanomunarini commented 11 months ago

As mentioned by @strogonoff

In case of that YAML, data contains many extra fields that are not in Glossarist model. That’s likely a mistake. I think the YAML you are seeing is probably output by some data conversion script that doesn’t follow the models as intended.

  • review* fields are not supposed to be there
  • dates list is not supposed to be there

If I get green light on this one, I can update the script to fix the data structure, removing excess data fields. It's straightforward, won't take long.

ronaldtse commented 8 months ago

I am lost in this thread. What is still pending?

The goal here is to synchronize the YAML structures for the Glossarist Ruby gem (used by jekyll-geolexica) and the Glossarist plugin.

This means we need to update all the data sets to the latest structure. That's it.

HassanAkbar commented 8 months ago

In case of that YAML, data contains many extra fields that are not in Glossarist model. That’s likely a mistake. I think the YAML you are seeing is probably output by some data conversion script that doesn’t follow the models as intended.

  • review* fields are not supposed to be there
  • dates list is not supposed to be there

If I get green light on this one, I can update the script to fix the data structure, removing excess data fields. It's straightforward, won't take long.

@ronaldtse just want to confirm that do we need to update this in glossarist or fix the data structure?

ronaldtse commented 8 months ago

@HassanAkbar so the tricky thing here is about the latest MLGT data which is done using this gem: https://github.com/geolexica/tc211-termbase .

The point is actually to upgrade the tc211-termbase gem to use the glossarist gem.

The input data for the gem is the XSLX file, and the output is a Glossarist YAML ConceptCollection.

HassanAkbar commented 8 months ago

@ronaldtse Let me summarize what’s going on here.

As I had no idea of Glossarist model V2 , to understand it @strogonoff suggested to take a look at paneron-extension-glossarist/models/concepts.ts.

While discussing about the format @strogonoff explained that review* and dates fields do not belong in the model here https://github.com/geolexica/jekyll-geolexica/issues/14#issuecomment-1784539896.

As these fields don't belong to Glossarist model, I believe we should let @stefanomunarini update the generation script so that the data in isotc211-glossary can be corrected. @stefanomunarini Can you help with that ?

stefanomunarini commented 8 months ago

As these fields don't belong to Glossarist model, I believe we should let @stefanomunarini update the generation script so that the data in isotc211-glossary can be corrected. @stefanomunarini Can you help with that ?

Sure, I've pushed a commit. You can now re run the script to update the data @HassanAkbar

HassanAkbar commented 8 months ago

@stefanomunarini I was looking at the isotc211-glossary and it seems like you updated the concepts last time. Can you let me know the steps needed to generate the isotc211-glossary?

stefanomunarini commented 8 months ago

Hi @HassanAkbar please review and merge this PR https://github.com/geolexica/isotc211-glossary/pull/44

ronaldtse commented 8 months ago

The content of isotc211-glossary is created by the tc211-termbase gem, which took the XLSX file and processed it into the old Glossarist YAML. Once I get back to the computer I’ll provide you with documentation.

ronaldtse commented 7 months ago

@HassanAkbar the tc211-termbase gem is updated at https://github.com/geolexica/tc211-termbase/pull/31 , can you now:

  1. regenerate the isotc211-glossary concept set and push the changes, and
  2. ensure that the output directly works with jekyll-geolexica?

Thanks.

HassanAkbar commented 7 months ago

@ronaldtse can you let me know from where can I get the xlsx file for generating the concepts?

ronaldtse commented 7 months ago

@HassanAkbar here's the file:

https://github.com/ISO-TC211/mlgt-data/blob/main/release-6/20231214%20Multi-Lingual%20Glossary%20–%20Published%20__unlocked__%20with%20Math.xlsx

HassanAkbar commented 7 months ago

@ronaldtse I think I don't have access to https://github.com/ISO-TC211/mlgt-data repo, can you help with that?

HassanAkbar commented 7 months ago

@ronaldtse just saw this issue -> authoritativeSources in localizedConcepts YAML are empty objects, Currently there is no support for authoritativeSources in glossarist and as we are using it to generate concepts in tc211-termbase, the output files does not have a authoritativeSources key in localized-concepts.

Should we add this in tc211-termbase or should I run a separate script after concepts generation is completed?

ronaldtse commented 7 months ago

@HassanAkbar yes we should add them in tc211-termbase. Previously there were sources in the generated output, I don't know where they have gone.

HassanAkbar commented 7 months ago

@HassanAkbar here's the file:

https://github.com/ISO-TC211/mlgt-data/blob/main/release-6/20231214%20Multi-Lingual%20Glossary%20–%20Published%20unlocked%20with%20Math.xlsx

@ronaldtse I've updated the glossary using the above file in this PR -> https://github.com/geolexica/isotc211-glossary/pull/47

I have a couple of questions related to the generated concept files

ronaldtse commented 7 months ago

@ronaldtse I've updated the glossary using the above file in this PR -> geolexica/isotc211-glossary#47

I have a couple of questions related to the generated concept files

  • Currently in the concept files the key for localized concepts is localized_concepts because it is generated using the glossarist and we use snake case convention in glossarist, while in the previous version the key was in camel casing i.e localizedConcepts. So what should I use now?

Use the Glossarist gem convention because jekyll-geolexica also uses the Glossarist gem. Correct?

  • Also the register information(info outside the data key) is not being added because it is not handled in glossarist so should I add that using a script or should I add this functionality in isotc211-termbase repo?

This should be added in the tc211-termbase gem so we can display them in isotc211.geolexica.org.

HassanAkbar commented 7 months ago

Use the Glossarist gem convention because jekyll-geolexica also uses the Glossarist gem. Correct?

@ronaldtse Currently it is not using glossarist gem. I will update jekyll-geolexica next to read the concepts using glossarist.