In order to support formatted text that could be used as advanced comments and descriptions there is a need to develop a new type RichText. The need and implementation complexity of this type lies in its dual nature – as formatted text expressed in a markup language and core text (i.e., without markup).
For example, given Markdown as the markup language, the following are examples of formatted text and core text respectively:
`cons` **does not** evaluate its arguments in a *lazy* language.
cons does not evaluate its arguments in a lazy language.
For values of RichText to be searchable, the markup needs to be ignored – only the core text needs to be searched. However, in order to preserve formatting for display purposes, there is also the need to store the text value with markup (formatted text).
Therefore, a single value of the RichText type should provide access to both core and formatted text. From the search perspective, which happens at the database engine level and should ignore formatting, there is a need to persist both the core and formatted values. EQL should interpret a RichText value as its core text component.
Initially, Markdown was used as the markup language for RichText, but then it was decided that HTML should be used instead.
[x] 1. Introduce RichText at the model level.
[x] 2. Extract core text from formatted text (depends on the markup).
[x] 3. Add Hibernate support for RichText: define a Hibernate user type and make it a composite one (see MoneyUserType).
[x] 4. Make sure PersistDomainMetadataModel supports properties of type RichText. It might be necessary to modify and generalise it to support arbitrary composite types.
[x] 5. Add support for RichText to EQL: interpret RichText as core text.
For example, in a query that contains prop("richText").eq().val(str), where richText: RichText, the resulting SQL should compare the value of str to the core text component of property richText.
[x] 5.1. Properties of type RichText.
[x] 5.2. Values of type RichText.
[x] 6. Determine the necessity for HTML sanitization at the model level. In particular, the question of "When should sanitization be performed?".
The goal is to ensure safety of the formatted text component. In the case of Markdown, which allows embedding arbitrary HTML, this boils down to the safety of HTML.
One approach is to enforce safety as an integrity constraint, through a validator which rejects RichText values that contain unsafe HTML. Safety can be determined by using an HTML sanitizer.
~- [ ] 6.1. Implement an implicit validator for RichText properties, enabled by default, that disallows unsafe content.~
[x] 6.1. Implement validation of input text used to construct RichText values: reject inputs that contain unsafe HTML.
[x] 7. Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
[x] 8. IsProperty.length at the level of a RichText property should apply to the coreText component.
[x] 8.1. Max. length validation
[x] 8.2. DB schema generation
[ ] 8.3. Property metadata (excluded due to unwarranted complexity: no strict need to have it implemented at the moment)
[x] 9. The type of RichText.formattedText should be mapped to a DB type for variable-length text (varchar for SQL Server, text for PostgreSQL).
[x] 10. RichText serialisation from & deserialisation into JSON objects.
Serialised RichText should have the following shape:
{ "formattedText": string, "coreText": string }
Deserialisation can assume validity of serialised objects that are received. This is due to the
fact that only unmodified property values are subject to deserialisation.
[x] 11. Deserialisation of modified RichText values.
ua.com.fielden.platform.web.utils.EntityResourceUtils#convert should be enhanced. Only formattedText need be considered. Validation must be performed.
HTML
Extract core text from formatted text.
Consider using jsoup.
Define a set of rules for markup stripping/transformation, which will be used to form the core text component.
[x] 7.1. Inline tags that modify text style should stripped (e.g., <b>, <i>, <code>): <b>text</b> ==> text.
[x] 7.2. Blocks should be replaced by the core text of their contents
Formatted text:
<pre>
hello world
</pre>
Core text:
hello world
[x] 7.3. Links should be transformed as follows:
<a href='link'>text</a> ==> text (link)
[x] 7.4. Image links should be transformed as follows:
[x] 7.10. Newline characters should be removed to facilitate search without a clumsy use of wildcards.
[x] 7.11. List markers, both bullet and ordered, should be removed.
1. one
2. two three
first
second third
into
one two three first second third
[x] 7.12. Inline HTML should be removed. It is highly unlikely that searching by HTML elements will be needed.
the <b>big</b> bang ==> the big bang
[x] 7.13. HTML blocks should be removed. Markdown already provides a number of useful block structures, so occurences of HTML blocks containing information that will need to be searched is highly unlikely.
HTML sanitization
CommonMark establishes very liberal rules for embedded HTML (4.6 HTML blocks).
Just some examples:
1. A block can start with a closing tag
2. An open tag need not be closed
3. A partial tag need not even be completed
4. The initial tag doesn’t even need to be a valid tag, as long as it starts like one.
To sanitize HTML inside a CommonMark document, there are 2 approaches:
Run the whole document through a sanitizer.
This is likely to go awry becase a sanitizer treats everything as HTML, which can result in unintended transformation of non-HTML content (e.g., backticks may be escaped).
Sanitize only the HTML parts.
This approach guarantees that non-HTML parts of a document won't be touched, but requires additional effort of processing a document and sanitizing only the HTML parts, which can be accomplished with the commonmark-java library.
Since we don't want to modify anything but the unsafe parts of a document, the second approach is preferred.
Given the above described integrity constraint that ensures safety of RichText contents, a mechanism for detecting unsafe parts is required so that the validator can do its job.
However, there are certain considerations to be taken into account when using the OWASP Java HTML sanitizer:
The OWASP Java HTML sanitizer does not provide a predicate that would determine validity of a given HTML document.
Policy violations can be tracked via HtmlChangeListener. This is the primary means for the validator to do its job.
If a string sanitizes with no change notifications (via HtmlChangeListener), it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.
This is taken from The OWASP Java HTML sanitizer project page.
In case no policy violations are reported, it should be safe to treat the validation as successful, but the correct value (a sanitizer's output) might not necessarily be the same as the validated one (a sanitizer's input).
Therefore, the actual value assigned to a RichText property must be the the one produced by the sanitizer.
Validators have no control over the value that ultimately gets assigned, so value substitution must happen somewhere else:
In a definer. This will require an implicit definer to be installed for all RichText properties.
Inside the property setter's body. Running a sanitizer twice for the same input is not efficient and might cause performance issues for large inputs.
Yet another approach is to shift from property validation to value validation. Specifically, to limit the validation to construction of RichText values by prohibiting invalid inputs. Then, validation of properties would no longer be necessary due to the invariant that guarantees validity of RichText values.
The types for database columns should be chosen as follows:
formattedText: String should be mapped to a column with name propertyName__formattedText of the largest text type with UTF-16 support. For SQL Server this would be NVARCHAR(MAX), for PostgreSQL this would be TEXT. No indexes are required when generating a database schema.
coreText: String should be mapped to a column with name propertyName__coreText of a text type with UTF-8 support with the size specified in attribute length of @IsProperty for propertyName. For both SQL Server and PostgerSQL this would be VARCHAR(length) (SQL Server requires a collation name ending _UTF8 to support UTF-8 in VARCHAR). Indexes are required when generating a database schema.
Note for future self:
Hibernate type for RichText was implemented to map formattedText using regular StringType, which has proven to work well with NVARCHAR column type in SQL Server.
Future work
Disallow RichText as composite key member. This should be enforced by the verifier.
Update the verifier to allow properties with type RichText.
Description
In order to support formatted text that could be used as advanced comments and descriptions there is a need to develop a new type
RichText
. The need and implementation complexity of this type lies in its dual nature – as formatted text expressed in a markup language and core text (i.e., without markup).For example, given Markdown as the markup language, the following are examples of formatted text and core text respectively:
For values of
RichText
to be searchable, the markup needs to be ignored – only the core text needs to be searched. However, in order to preserve formatting for display purposes, there is also the need to store the text value with markup (formatted text).Therefore, a single value of the
RichText
type should provide access to both core and formatted text. From the search perspective, which happens at the database engine level and should ignore formatting, there is a need to persist both the core and formatted values. EQL should interpret aRichText
value as its core text component.Initially, Markdown was used as the markup language for
RichText
, but then it was decided that HTML should be used instead.RichText
at the model level.RichText
: define a Hibernate user type and make it a composite one (seeMoneyUserType
).PersistDomainMetadataModel
supports properties of typeRichText
. It might be necessary to modify and generalise it to support arbitrary composite types.RichText
to EQL: interpretRichText
as core text. For example, in a query that containsprop("richText").eq().val(str)
, whererichText: RichText
, the resulting SQL should compare the value ofstr
to the core text component of propertyrichText
.RichText
.RichText
.[x] 6. Determine the necessity for HTML sanitization at the model level. In particular, the question of "When should sanitization be performed?".
The goal is to ensure safety of the formatted text component. In the case of Markdown, which allows embedding arbitrary HTML, this boils down to the safety of HTML. One approach is to enforce safety as an integrity constraint, through a validator which rejects
RichText
values that contain unsafe HTML. Safety can be determined by using an HTML sanitizer.~- [ ] 6.1. Implement an implicit validator for
RichText
properties, enabled by default, that disallows unsafe content.~RichText
values: reject inputs that contain unsafe HTML.IsProperty.length
at the level of aRichText
property should apply to thecoreText
component.[ ] 8.3. Property metadata(excluded due to unwarranted complexity: no strict need to have it implemented at the moment)RichText.formattedText
should be mapped to a DB type for variable-length text (varchar
for SQL Server,text
for PostgreSQL).RichText
serialisation from & deserialisation into JSON objects.RichText
should have the following shape:RichText
values.ua.com.fielden.platform.web.utils.EntityResourceUtils#convert
should be enhanced. OnlyformattedText
need be considered. Validation must be performed.HTML
<b>
,<i>
,<code>
):<b>text</b> ==> text
.Core text:
into
Markdown
**text** ==> text
).*text* ==> text
and_text_ ==> text
).>
).one two three first second third
HTML sanitization
CommonMark establishes very liberal rules for embedded HTML (4.6 HTML blocks). Just some examples: 1. A block can start with a closing tag 2. An open tag need not be closed 3. A partial tag need not even be completed 4. The initial tag doesn’t even need to be a valid tag, as long as it starts like one.
To sanitize HTML inside a CommonMark document, there are 2 approaches:
commonmark-java
library.Since we don't want to modify anything but the unsafe parts of a document, the second approach is preferred.
For HTML sanitization, the OWASP Java HTML sanitizer can be used.
Validate and Sanitize
Given the above described integrity constraint that ensures safety of
RichText
contents, a mechanism for detecting unsafe parts is required so that the validator can do its job. However, there are certain considerations to be taken into account when using the OWASP Java HTML sanitizer:HtmlChangeListener
. This is the primary means for the validator to do its job.If a string sanitizes with no change notifications (via
HtmlChangeListener
), it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.This is taken from The OWASP Java HTML sanitizer project page. In case no policy violations are reported, it should be safe to treat the validation as successful, but the correct value (a sanitizer's output) might not necessarily be the same as the validated one (a sanitizer's input). Therefore, the actual value assigned to a
RichText
property must be the the one produced by the sanitizer. Validators have no control over the value that ultimately gets assigned, so value substitution must happen somewhere else:RichText
properties.Yet another approach is to shift from property validation to value validation. Specifically, to limit the validation to construction of
RichText
values by prohibiting invalid inputs. Then, validation of properties would no longer be necessary due to the invariant that guarantees validity ofRichText
values.For more details see the page on validation.
Database mapping
The types for database columns should be chosen as follows:
formattedText: String
should be mapped to a column with namepropertyName__formattedText
of the largest text type with UTF-16 support. For SQL Server this would beNVARCHAR(MAX)
, for PostgreSQL this would beTEXT
. No indexes are required when generating a database schema.coreText: String
should be mapped to a column with namepropertyName__coreText
of a text type with UTF-8 support with the size specified in attributelength
of@IsProperty
forpropertyName
. For both SQL Server and PostgerSQL this would beVARCHAR(length)
(SQL Server requires a collation name ending_UTF8
to support UTF-8 inVARCHAR
). Indexes are required when generating a database schema.Note for future self:
RichText
was implemented to mapformattedText
using regularStringType
, which has proven to work well withNVARCHAR
column type in SQL Server.Future work
RichText
.