cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

HTML tags in CDC and OAI-PMHs #318

Closed cessda-bitbucket-importer closed 3 years ago

cessda-bitbucket-importer commented 3 years ago

Original report on BitBucket by Taina Jääskeläinen.


Taina:

Why is the abstract in this dataset shown as a bulk and not in paragraphs?

https://datacatalogue-staging.cessda.eu/detail?q="GESIS__oai:dbk.gesis.org:DBK/ZA6713"

‌Is it because of the way GESIS is sending it (their OAI-PMH) or the way CDC is displaying this information? On the GESIS site the abstract is in paragraphs.

https://dbk.gesis.org/dbksearch/sdesc2.asp?no=6713&db=e

Matthew: The GESIS abstract uses newlines to separate paragraphs, and does not use HTML elements such as <p> or <br>. Newlines are ignored by HTML parsers, so the CDC doesn’t render them.

Taina: The FSD OAI-PMH seems to have the same issue, all tags are stripped. Toni said this is because the OAI-PMH is used in some places where the information content is displayed not in DDI xml format. Therefore also there abstract has newlines.

What should be done about this? As Kuha will be used also for CDC OAI-PMH, maybe take this up between you tech guys. From the user point of view, reading long abstracts that do not have paragraphs etc. is very cumbersome.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


Conclusion of discussion in CDC User Group:

However, will need to consider and discuss this also from the aggregator point of view also: will all the catalogues that may use the aggregator have this coding to display the paragraphs correctly? Are they expecting newlines or can they handle the textual tags?

Will keep the issue here and ‘On hold’ for the time being as does not require anything from CDC end at this point.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


The users of course would prefer to read abstracts with paragraphs instead of one big lump of text. However, this does not seem solvable.

The CDC has been trained to handle the most common HTML tags which some SPs have in their endpoints, so what can be done at CDC end has been done.

Other SPs remove the tags from their OAI-PMHs, because apparently this is what actually should be done from the technical point of view. From the technical point of view, it seems it would not be good policy to ask SPs to include the HTML tags.

DDI only allows

tags. The choice of repeating the abstract for paragraphs does not work either since repeating the abstract is often used for different language versions.

I’m resolving this unresolvable issue.