Property type: RichText

Description

In order to support formatted text that could be used as advanced comments and descriptions there is a need to develop a new type RichText. The need and implementation complexity of this type lies in its dual nature – as formatted text expressed in a markup language and core text (i.e., without markup).

For example, given Markdown as the markup language, the following are examples of formatted text and core text respectively:

`cons` **does not** evaluate its arguments in a *lazy* language.

cons does not evaluate its arguments in a lazy language.

For values of RichText to be searchable, the markup needs to be ignored – only the core text needs to be searched. However, in order to preserve formatting for display purposes, there is also the need to store the text value with markup (formatted text).

Therefore, a single value of the RichText type should provide access to both core and formatted text. From the search perspective, which happens at the database engine level and should ignore formatting, there is a need to persist both the core and formatted values. EQL should interpret a RichText value as its core text component.

Initially, Markdown was used as the markup language for RichText, but then it was decided that HTML should be used instead.

[x] 1. Introduce RichText at the model level.
[x] 2. Extract core text from formatted text (depends on the markup).
[x] 3. Add Hibernate support for RichText: define a Hibernate user type and make it a composite one (see MoneyUserType).
[x] 4. Make sure PersistDomainMetadataModel supports properties of type RichText. It might be necessary to modify and generalise it to support arbitrary composite types.
[x] 5. Add support for RichText to EQL: interpret RichText as core text. For example, in a query that contains prop("richText").eq().val(str), where richText: RichText, the resulting SQL should compare the value of str to the core text component of property richText.
- [x] 5.1. Properties of type RichText.
- [x] 5.2. Values of type RichText.
[x] 6. Determine the necessity for HTML sanitization at the model level. In particular, the question of "When should sanitization be performed?".

The goal is to ensure safety of the formatted text component. In the case of Markdown, which allows embedding arbitrary HTML, this boils down to the safety of HTML. One approach is to enforce safety as an integrity constraint, through a validator which rejects RichText values that contain unsafe HTML. Safety can be determined by using an HTML sanitizer.

~- [ ] 6.1. Implement an implicit validator for RichText properties, enabled by default, that disallows unsafe content.~
- [x] 6.1. Implement validation of input text used to construct RichText values: reject inputs that contain unsafe HTML.
[x] 7. Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
[x] 8. IsProperty.length at the level of a RichText property should apply to the coreText component.
- [x] 8.1. Max. length validation
- [x] 8.2. DB schema generation
- ~~[ ] 8.3. Property metadata~~ (excluded due to unwarranted complexity: no strict need to have it implemented at the moment)
[x] 9. The type of RichText.formattedText should be mapped to a DB type for variable-length text (varchar for SQL Server, text for PostgreSQL).
[x] 10. RichText serialisation from & deserialisation into JSON objects.
- Serialised RichText should have the following shape:
```
{ "formattedText": string, "coreText": string }
```
- Deserialisation can assume validity of serialised objects that are received. This is due to the fact that only unmodified property values are subject to deserialisation.
[x] 11. Deserialisation of modified RichText values. ua.com.fielden.platform.web.utils.EntityResourceUtils#convert should be enhanced. Only formattedText need be considered. Validation must be performed.

HTML

1. Extract core text from formatted text. Consider using jsoup.
1. Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
 - [x] 7.1. Inline tags that modify text style should stripped (e.g., , , <code>): text ==> text.
 - [x] 7.2. Blocks should be replaced by the core text of their contents Formatted text:
```
<pre>
hello world
</pre>
```
 Core text:
```
hello world
```
 - [x] 7.3. Links should be transformed as follows:
```
<a href='link'>text</a> ==> text (link)
```
 - [x] 7.4. Image links should be transformed as follows:
```
<img src='link' alt='text' /> ==> text (link)
```
 - [x] 7.5. Thematic breaks should be removed.
 - [x] 7.6. Newline characters should be removed to facilitate search without a clumsy use of wildcards.
 - [x] 7.7. Lists should flattened and then joined into a single line.
```
<ul>
<li> one
<ol> 
<li> two three
</ol>
<li> first
<ul>
<li> second third
</ul>
</ul>
```
 into
```
one two three first second third
```

Markdown

1. Extract core text from formatted text. Consider using commonmark-java.
1. Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
 - [x] 7.1. Boldface should be stripped (**text** ==> text).
 - [x] 7.2. Italics should be stripped (*text* ==> text and _text_ ==> text).
 - [x] 7.3. Code backticks should be stripped
```
`text` ==> text
```
 - [x] 7.4. Quote blocks should be transformed into regular text by removing the block quote marker (>).
 - [x] 7.5. Links should be transformed as follows:
```
[text](link title) ==> text (link title)
```
 - [x] 7.6. Fenced code blocks should be transformed into regular text by removing backtics and tildes.
 - [x] 7.7. Image link - in the same way as ordinary links.
 - [x] 7.8. Heading characters should be removed (Setext, ATX).
 - [x] 7.9. Thematic breaks should be removed.
 - [x] 7.10. Newline characters should be removed to facilitate search without a clumsy use of wildcards.
 - [x] 7.11. List markers, both bullet and ordered, should be removed.
```
1. one
2. two three
```
 - first
 - second third
```
into
```
 one two three first second third
 - [x] 7.12. Inline HTML should be removed. It is highly unlikely that searching by HTML elements will be needed.
```
the big bang ==> the big bang
```
 - [x] 7.13. HTML blocks should be removed. Markdown already provides a number of useful block structures, so occurences of HTML blocks containing information that will need to be searched is highly unlikely.

HTML sanitization

CommonMark establishes very liberal rules for embedded HTML (4.6 HTML blocks). Just some examples: 1. A block can start with a closing tag 2. An open tag need not be closed 3. A partial tag need not even be completed 4. The initial tag doesn’t even need to be a valid tag, as long as it starts like one.

To sanitize HTML inside a CommonMark document, there are 2 approaches:

Run the whole document through a sanitizer. This is likely to go awry becase a sanitizer treats everything as HTML, which can result in unintended transformation of non-HTML content (e.g., backticks may be escaped).
Sanitize only the HTML parts. This approach guarantees that non-HTML parts of a document won't be touched, but requires additional effort of processing a document and sanitizing only the HTML parts, which can be accomplished with the commonmark-java library.

Since we don't want to modify anything but the unsafe parts of a document, the second approach is preferred.

For HTML sanitization, the OWASP Java HTML sanitizer can be used.

Validate and Sanitize

Given the above described integrity constraint that ensures safety of RichText contents, a mechanism for detecting unsafe parts is required so that the validator can do its job. However, there are certain considerations to be taken into account when using the OWASP Java HTML sanitizer:

The OWASP Java HTML sanitizer does not provide a predicate that would determine validity of a given HTML document.
Policy violations can be tracked via HtmlChangeListener. This is the primary means for the validator to do its job.
If a string sanitizes with no change notifications (via HtmlChangeListener), it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.

This is taken from The OWASP Java HTML sanitizer project page. In case no policy violations are reported, it should be safe to treat the validation as successful, but the correct value (a sanitizer's output) might not necessarily be the same as the validated one (a sanitizer's input). Therefore, the actual value assigned to a RichText property must be the the one produced by the sanitizer. Validators have no control over the value that ultimately gets assigned, so value substitution must happen somewhere else:
- In a definer. This will require an implicit definer to be installed for all RichText properties.
- Inside the property setter's body. Running a sanitizer twice for the same input is not efficient and might cause performance issues for large inputs.
Yet another approach is to shift from property validation to value validation. Specifically, to limit the validation to construction of RichText values by prohibiting invalid inputs. Then, validation of properties would no longer be necessary due to the invariant that guarantees validity of RichText values.

For more details see the page on validation.

Database mapping

The types for database columns should be chosen as follows:

formattedText: String should be mapped to a column with name propertyName__formattedText of the largest text type with UTF-16 support. For SQL Server this would be NVARCHAR(MAX), for PostgreSQL this would be TEXT. No indexes are required when generating a database schema.
coreText: String should be mapped to a column with name propertyName__coreText of a text type with UTF-8 support with the size specified in attribute length of @IsProperty for propertyName. For both SQL Server and PostgerSQL this would be VARCHAR(length) (SQL Server requires a collation name ending _UTF8 to support UTF-8 in VARCHAR). Indexes are required when generating a database schema.

Note for future self:

Hibernate type for RichText was implemented to map formattedText using regular StringType, which has proven to work well with NVARCHAR column type in SQL Server.

Future work

Disallow RichText as composite key member. This should be enforced by the verifier.
Update the verifier to allow properties with type RichText.

fieldenms / tg