fieldenms / tg

Trident Genesis
MIT License
14 stars 7 forks source link

Property type: RichText #348

Open 01es opened 8 years ago

01es commented 8 years ago

Description

In order to support formatted text that could be used as advanced comments and descriptions there is a need to develop a new type RichText. The need and implementation complexity of this type lies in its dual nature – as formatted text expressed in a markup language and core text (i.e., without markup).

For example, given Markdown as the markup language, the following are examples of formatted text and core text respectively:

`cons` **does not** evaluate its arguments in a *lazy* language. 
cons does not evaluate its arguments in a lazy language.

For values of RichText to be searchable, the markup needs to be ignored – only the core text needs to be searched. However, in order to preserve formatting for display purposes, there is also the need to store the text value with markup (formatted text).

Therefore, a single value of the RichText type should provide access to both core and formatted text. From the search perspective, which happens at the database engine level and should ignore formatting, there is a need to persist both the core and formatted values. EQL should interpret a RichText value as its core text component.

Initially, Markdown was used as the markup language for RichText, but then it was decided that HTML should be used instead.

HTML

Markdown

HTML sanitization

CommonMark establishes very liberal rules for embedded HTML (4.6 HTML blocks). Just some examples: 1. A block can start with a closing tag 2. An open tag need not be closed 3. A partial tag need not even be completed 4. The initial tag doesn’t even need to be a valid tag, as long as it starts like one.

To sanitize HTML inside a CommonMark document, there are 2 approaches:

  1. Run the whole document through a sanitizer. This is likely to go awry becase a sanitizer treats everything as HTML, which can result in unintended transformation of non-HTML content (e.g., backticks may be escaped).
  2. Sanitize only the HTML parts. This approach guarantees that non-HTML parts of a document won't be touched, but requires additional effort of processing a document and sanitizing only the HTML parts, which can be accomplished with the commonmark-java library.

Since we don't want to modify anything but the unsafe parts of a document, the second approach is preferred.

For HTML sanitization, the OWASP Java HTML sanitizer can be used.

Validate and Sanitize

Given the above described integrity constraint that ensures safety of RichText contents, a mechanism for detecting unsafe parts is required so that the validator can do its job. However, there are certain considerations to be taken into account when using the OWASP Java HTML sanitizer:

  1. The OWASP Java HTML sanitizer does not provide a predicate that would determine validity of a given HTML document.
  2. Policy violations can be tracked via HtmlChangeListener. This is the primary means for the validator to do its job.
  3. If a string sanitizes with no change notifications (via HtmlChangeListener), it is not the case that the input string is necessarily safe to use. Only use the output of the sanitizer.

    This is taken from The OWASP Java HTML sanitizer project page. In case no policy violations are reported, it should be safe to treat the validation as successful, but the correct value (a sanitizer's output) might not necessarily be the same as the validated one (a sanitizer's input). Therefore, the actual value assigned to a RichText property must be the the one produced by the sanitizer. Validators have no control over the value that ultimately gets assigned, so value substitution must happen somewhere else:

    • In a definer. This will require an implicit definer to be installed for all RichText properties.
    • Inside the property setter's body. Running a sanitizer twice for the same input is not efficient and might cause performance issues for large inputs.

    Yet another approach is to shift from property validation to value validation. Specifically, to limit the validation to construction of RichText values by prohibiting invalid inputs. Then, validation of properties would no longer be necessary due to the invariant that guarantees validity of RichText values.

For more details see the page on validation.

Database mapping

The types for database columns should be chosen as follows:

  1. formattedText: String should be mapped to a column with name propertyName__formattedText of the largest text type with UTF-16 support. For SQL Server this would be NVARCHAR(MAX), for PostgreSQL this would be TEXT. No indexes are required when generating a database schema.
  2. coreText: String should be mapped to a column with name propertyName__coreText of a text type with UTF-8 support with the size specified in attribute length of @IsProperty for propertyName. For both SQL Server and PostgerSQL this would be VARCHAR(length) (SQL Server requires a collation name ending _UTF8 to support UTF-8 in VARCHAR). Indexes are required when generating a database schema.

Note for future self:

Future work

homedirectory commented 1 year ago

Note: the property verifier will have to be updated to support the new property type RichText