Language parsing problems

cessda-bitbucket-importer commented 6 years ago

Original report on BitBucket by Cessda Techframe (GitHub: cessda).

CDC has issues in interpreting different language versions of the same metadata. I assume these two errors are instances of the same problem.

1. Content for language specific time dimension should be harvested from element with correct xml:lang attribute.

For example:

 <timeMeth xml:lang="fi">Pitkittäisaineisto: trendi/toistuva

poikkileikkausaineistoLongitudinal.TrendRepeatedCrossSection

Longitudinal: Trend/Repeated cross-sectionLongitudinal.TrendRepeatedCrossSection

results in two language versions:

 fi: Pitkittäisaineisto: trendi/toistuva poikkileikkausaineisto
 en: Longitudinal: Trend/Repeated cross-section

2. Content for language specific analysis unit should be harvested from element with correct xml:lang attribute.

For example:

 <anlyUnit xml:lang="fi">Henkilö<concept>Individual</concept>
 </anlyUnit>
 <anlyUnit xml:lang="en">Individual<concept>Individual</concept>
 </anlyUnit>

results in two language versions:

 fi: Henkilö
 en: Individual

The incorrect interpretation can be seen in the following URL when switching the language between English and Finnish: English content contains the Finnish language versions, the Finnish content contains none.

https://datacatalogue-dev.cessda.eu/detail?q="Finish-Data-Services__oai%3Afsd.uta.fi%3AFSD3187"

Best regards, Toni Sissala

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

See also See also #87

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).

Left comment for #87: https://github.com/cessda/cessda.pasc.version2/issues/87

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).

I have tried many means to reproduce this with no luck. In brief:

Added unit test also to prevent any regression issue around this area in future. See [PR here](link to pull request removed)
I have also run a few local full-re-index to confirm this works as expected.

I can confirm that newer iteration have fixed this issue and this bug must have been raised against initial versions of the CDC pipeline.

Live examples of this working:

Record: https://datacatalogue-dev.cessda.eu/detail?q=%22Finish-Data-Services__oai%3Afsd.uta.fi%3AFSD3187%22

Json compare for English Index and the Finnish Index

Screenshot 2019-04-23 at 14.25.17.png

See raw json here:

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).

Assigning over to you @‌jws_mo to verify and close. For a faster feedback I have merged PR myself as it is a straight forward unit test addition.

link to pull request removed

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

Contacted Toni Sissala to say we believe we have fixed this. Am awaiting his response.

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

From: Toni Sissala (TAU) Sent: 30 April 2019 13:20 To: Shepherdson, John W; Toni Sissala Cc: Matti Heinonen Subject: Re: CESSDA Data Catalogue - Issues in interpreting language versions

Hi John,

I had a look and can confirm that the issue seems to be resolved. Thanks!

Best regards, Toni

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

CDC has issues in interpreting different language versions of the same metadata. I assume these two errors are instances of the same problem.

1. Content for language specific time dimension should be harvested from element with correct xml:lang attribute.

For example:

 <timeMeth xml:lang="fi">Pitkittäisaineisto: trendi/toistuva

poikkileikkausaineistoLongitudinal.TrendRepeatedCrossSection

Longitudinal: Trend/Repeated cross-sectionLongitudinal.TrendRepeatedCrossSection

results in two language versions:

 fi: Pitkittäisaineisto: trendi/toistuva poikkileikkausaineisto
 en: Longitudinal: Trend/Repeated cross-section

2. Content for language specific analysis unit should be harvested from element with correct xml:lang attribute.

For example:

 <anlyUnit xml:lang="fi">Henkilö<concept>Individual</concept>
 </anlyUnit>
 <anlyUnit xml:lang="en">Individual<concept>Individual</concept>
 </anlyUnit>

results in two language versions:

 fi: Henkilö
 en: Individual

The incorrect interpretation can be seen in the following URL when switching the language between English and Finnish: English content contains the Finnish language versions, the Finnish content contains none.

https://datacatalogue-dev.cessda.eu/detail?q="Finish-Data-Services__oai%3Afsd.uta.fi%3AFSD3187"

Best regards, Toni Sissala

cessda / cessda.cdc.versions

Language parsing problems #53