cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Html tags in text fields #226

Closed cessda-bitbucket-importer closed 3 years ago

cessda-bitbucket-importer commented 4 years ago

Original report on BitBucket by Taina Jääskeläinen.


CDC does not seem to read html tags like
or

tags properly, which makes abstracts and other textual elements rather hard to read.

For instance, in this document: https://datacatalogue.cessda.eu/detail?q="UKDS__8661"

Data access has correct

html tags, but the paragraphs are not displayed properly and CDC shows the tags instead.

This is how the metadata is displayed on the UKDS website (you need to open the full description)

https://beta.ukdataservice.ac.uk/datacatalogue/studies/study?id=8661

I consulted our expert. He assumes that the CDC parses the XML-file (or JSON-file?) and creates the needed HTML structure for the catalogue view. But the content of any text field is displayed as a string and the CDC ignores all HTML-tags from the source and only displays control characters for newline (/n). This also makes any hyperlinks in the abstract to invalid. 

Topic and keyword may have some different coding as are displayed differently.

@matthew-morris-cessda @john-shepherdson

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


This is caused by the Datacatalogue filtering out HTML tags before rending the abstract.

I'm working on a fix for   appearing, as that shouldn't be rendered.

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


The   issue has been fixed.

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


tags are most frequent. We can either strip them or interpret them \(as an exception\). Decision by @‌TainaFSD is to interpret them.
cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


Merged as of https://github.com/cessda/cessda.cdc.searchkit/commit/25ebe90fd32eb08b6f03e6399f742c371036d71c.

Styling questions regarding the display of these tags should be discussed in a separate issue.

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Tags still visible in data access section

Screenshot 2020-11-18 at 14.55.23.png

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


Terms of data access is now interpreted as HTML

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Checked as working on staging instance

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


Found one more in Sampling procedure

 https://datacatalogue-staging.cessda.eu/detail?q=%22UKDS__369%22

Two-stage
1. Systematic: 60 out of 71 parliamentary contituencies: stratified by political complexion, conurbation/urban/rural, and size of electorate
2. Quota: equal-sized quota sample of individuals aged 18-plus resident in each constituency. Quota controls (4 age and 4 social class groups within sex) were based on the most recent census and IPA National Readership Survey data available

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


More tags found

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


The Sampling Procedure field now displays correctly.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


Why is the abstract in this dataset shown as a bulk and not in paragraphs?

https://datacatalogue-staging.cessda.eu/detail?q="GESIS__oai:dbk.gesis.org:DBK/ZA6713"

Is it because of the way GESIS is sending it (their OAI-PMH) or the way CDC is displaying this information? On the GESIS site the abstract is in paragraphs.

https://dbk.gesis.org/dbksearch/sdesc2.asp?no=6713&db=e

If GESIS, I need to do a metadata ticket for them.

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


The GESIS abstract uses newlines to separate paragraphs, and does not use HTML elements such as <p> or <br>.

Newlines are ignored by HTML parsers, so the CDC doesn’t render them.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


The FSD OAI-PMH seems to have the same issue, all tags are stripped. Toni said this is because the OAI-PMH is used in some places where the information content is displayed not in DDI xml format. Therefore also there abstract has newlines.

What should be done about this? As Kuha will be used also for CDC OAI-PMH, maybe take this up between you tech guys. From the user point of view, reading abstracts that do not have paragraphs etc. is really cumbersome.

I don’t think this can be solved by the next CDC release, so editing to next version. Probably another issue needs to be made for this.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


Closing this issue, as the html issues have been resolved. Have made another issue #318 for the newlines problem.