Open blackmambahk opened 6 years ago
Two issues:
single tags <BUILDING
NUMBER>
double tags <<ME>>
used for emphasis.
Whitelist sanitizers like Caja will strip unknown tags so the single tag case will be removed, blacklist sanitizers will leave it in but then when inserted into the DOM the browser will treat them as tags and they will 'vanish' so whatever sanitizer we use the content will be effectively removed if we are applying the sanitization on output.
It is easy to sanitize keyboard input, tags are escaped, the hard part comes with pasted input.
How to sanitize it? Is it HTML or is it text?
If you knew it was plain text you can escape everything else very quickly.
But more often than not it will be HTML, you cant simply escape it
Carriage returns are also an issue particularly for us, even though carriage returns don't exist in HTML we treat them as if they do and have to try to preserve returns even in HTML content.
The issue is further complicated because we don't distinguish between stored HTML and text data. Some fields only have input via a textarea and so can only be text others allow RTF. The difference is you MUST NOT escape text if it is from or intended for textarea but you MUST if it is RTF.
An example is the short and long description in Specifications, the input uses a TextArea control so it can only be pure text, but on display it runs through the flex formatter so it is output as HTML with crlf converted to BR tags. So if the user enters a pseudo tag in the textarea it is valid and cannot be escaped or it will not be editable again but on display it must be escaped because it is being output as HTML but we can't possibly postfix/escape pseudo tags in HTML. This is further compounded with the question of what format was used in classic if it was text we can solve the problem by changing the formatting in platinum for this specific case from html to text but if it was html formatted in classic then we can't simply switch the formatting, the use of flex as opposed to html supposedly indicates it was html formatted in classic.
Similarly with pasted content to generic input controls, this is pure text and should never be sanitized.
Another issue is invalid HTML input, generally caused by cut and pasting partial tags, the user selects some text in another application and when copied it includes the opening tag but not the closing tag. If not fixed on input, this will break the application on output display.
The do it on input or do it on output choice is also very important from a performance perspective. Sanitization is a very resource (cpu) and time expensive process.
On input we can live with say a frames delay whilst we perform an async sanitisation, it's for one user only, and one time only.
On output there are generally dozens to hundreds of separate pieces of text to be sanitised at once (say in a list) and the process gets repeated every time the DataSource is refreshed. The sheer number produces longer and longer delays and requires a spinner display every time.
On input we know if its text or HTML on output we don't.
It really can't make sense to do this on output.
So the solution is multipart. The RTE editor part should be relocated to an iframe so that whilst the user is editing no part of what they enter can impact the rest of the application. When they save or leave the control the content should be sanitized before storing.
Any pasted content should also be sanitised before inserting into the application. As far as I am aware the only place RTF content can be pasted into the app is via the RTE (pasting into textarea or input controls automatically strips all markup).
If this was done no sanitization would be required at runtime which would be a performance boost.
But how to handle historic content that has not been sanitised?
Really this needs handling now and will be a real pain but it will decline with time but key to this is being able to distinguish between sanitized and unsanitized content.
The only way I can see to do this is to add a hidden flag into sanitised content. This would be some HTML tag that is invisible to the user. It would either be a tag they would never use or have a unique attribute.
kbd bdo bdi var tt are tags hich would never appear normally in user content, or a span tag with attribute hidden.
This would give a very simple check to see if a string was clean.
string.indexOf("<span hidden></span>
")
and would completely invisible to the user
So that was the how, now what to use for sanitization. gg wants to use Caja, this is a standalone java based server application, intended mainly for running js games in isolation in a pseudo VM in the front end. Not compatible with IIS and would have to run on it's own as a standalone server, don't think it is really relevant but it has a sanitization plugin that can be stripped out of Caja and run on its own client side.
The plugin looks like it was abandoned 3 years ago and some important updates are not being made (not HTML5 aware), it also doesn't for example 'fix' single quotes. Documentation seems almost non existent or has been lost over the years and generally points at blank pages.
but it mostly does the job I guess.
It is a whitelist based sanitizer and requires installing ANT the java based taskrunner to compile a whitelist, it only supports HTML4 not HTML5, maybe we can work around that with a custom whitelist.
Seems most people push it into an iframe or a webworker because of overhead and conflicts. All our targets platforms support webworkers.
But really we need to both whitelist certain known tags, and also blacklist certain known tags so that we can then allow through all other pseudo tags by escaping them.
A simple whitelist approach won't be sufficient, we either need a sanitizer that support both black and white lists or it whitelists but has a callback to check every node on the fly so we can control the process.
So to summarise:
<max limit> <<min headroom>>
must be preserved and escapedAny sanitizer MUST be able to achieve all of the above in order to successfully fix all the issues we currently have
All sanitizer's that work directly with the DOM can only be used on input sanitization, the overhead involved in inserting content into an iframe, parsing and instantiating it before manipulation can start rules out this approach for output.
Possible modules: https://github.com/bevacqua/insane - whitelist but has filter and transformText callback which allow us to manipulate the process on the fly - it would still probably need some rewriting for our purposes.
https://github.com/ecto/bleach - really it is he and fs wrapped but looks very easy to modify to our purposes.
https://github.com/punkave/sanitize-html support filter and transformText
https://github.com/JamesRyanATX/janitor-js DOM Based - use only as base code
https://github.com/google/caja/blob/master/src/com/google/caja/plugin/html-sanitizer.js - Caja whitelist only
https://github.com/dortzur/simple-sanitizer - requires a DOM would require rewriting to do what we want but look straightforward to do so
https://github.com/gbirke/Sanitize.js - whitelist but sophisticated options includes transformers and filter
Regex simple sanitizer
String.prototype.sanitizeHTML=function (white,black) {
if (!white) white="b|i|p|br";//allowed tags
if (!black) black="script|object|embed";//complete remove tags
var e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
return this.replace(e,"");
}
```-black list -> complete remove tag and content
-white list -> retain tags
-other tags are removed but tag content is retained
-all attributes of white list tag's (the remaining ones) are removed
Most non-DOM sanitizer are built on top of an html parser, Caja is based on the old java SAX parser, most of the rest are derived from John Resig's original htmlParser. Once you have it parsed walking a tree and deciding what to keep and what not is fairly straightforward, even if time consuming.
DOM walking/sanitizing code
var tagWhitelist_ = {
'A': true,
'B': true,
'BODY': true,
'BR': true,
'DIV': true,
'EM': true,
'HR': true,
'I': true,
'IMG': true,
'P': true,
'SPAN': true,
'STRONG': true
};
var attributeWhitelist_ = {
'href': true,
'src': true
};
function sanitizeHtml(input) {
var iframe = document.createElement('iframe');
if (iframe['sandbox'] === undefined) {
alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
return '';
}
iframe['sandbox'] = 'allow-same-origin';
iframe.style.display = 'none';
document.body.appendChild(iframe); // necessary so the iframe contains a document
iframe.contentDocument.body.innerHTML = input;
function makeSanitizedCopy(node) {
if (node.nodeType == Node.TEXT_NODE) {
var newNode = node.cloneNode(true);
} else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
newNode = iframe.contentDocument.createElement(node.tagName);
for (var i = 0; i < node.attributes.length; i++) {
var attr = node.attributes[i];
if (attributeWhitelist_[attr.name]) {
newNode.setAttribute(attr.name, attr.value);
}
}
for (i = 0; i < node.childNodes.length; i++) {
var subCopy = makeSanitizedCopy(node.childNodes[i]);
newNode.appendChild(subCopy, false);
}
} else {
newNode = document.createDocumentFragment();
}
return newNode;
};
var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
document.body.removeChild(iframe);
return resultElement.innerHTML;
};
Conclusion: No current software can satisfy all our requirements, though some could form the basis of code that would come close.
We need to split the sanitization into two phases input and output.
INPUT In the input phase we should be able to achieve complete sanitization. The only place in the app that html content can be entered is via the RTE either through kbd input or pasting from another app.
The pasted content will be injected into an iframe with scripting disabled. This will solve the problem of parsing and validating the content. With a valid structure we can then walk the dom, removing blacklisted nodes, leaving whitelisted nodes and converted unknown nodes into text nodes by escaping the tags.
We can't use an iframe for two reasons:
1) You can't tab into or out of an iframe (possibly fixable by code)
2) Our editor expands as you type instead of being a fixed size - iframe doesn't support this and we would need some code constantly monitoring the editor size inside the iframe and posting that back to the parent.
So we use a contenteditable but strip the string before insertion as this seems to work ok at the moment. The overhead involved will be ok for a single instance and a one time process. The cleaned up input will be tagged as such.
OUTPUT In the output phase we have to recognise the problem should have been completely solved by the input phase and we will do nothing for content that is flagged as having been processed on input.
For historic content that has not been processed we have to acknowledge we cannot sanitize it completely at runtime as the overhead is too great given the number of simultaneous instances that need processing and that it needs to be repeatedly done on every refresh. We also recognise that 6 months down the line the input phases fix will have effectively eliminated the problem.
The aim here then is to sanitize in a performant but limited way, i.e. no worse than we currently do. Probably the best we can achieved is removal of everything that is blacklisted and a rebalancing of tags. We will scan for strings that are pure text and encode entities and preserve crlf.
This could probably be done by adding a couple of functions to our existing code rather than needing a new module
Sample tree balancing code
// balance:
// - takes an excerpted or truncated XHTML string
// - returns a well-balanced XHTML string
function balance(string) {
// Check for broken tags, e.g. <stro
// Check for a < after the last >, indicating a broken tag
if (string.lastIndexOf("<") > string.lastIndexOf(">")) {
// Truncate broken tag
string = string.substring(0,string.lastIndexOf("<"));
}
// Check for broken elements, e.g. <strong>Hello, w
// Get an array of all tags (start, end, and self-closing)
var tags = string.match(/<[^>]+>/g);
var stack = new Array();
for (tag in tags) {
if (tags[tag].search("/") <= 0) {
// start tag -- push onto the stack
stack.push(tags[tag]);
} else if (tags[tag].search("/") == 1) {
// end tag -- pop off of the stack
stack.pop();
} else {
// self-closing tag -- do nothing
}
}
// stack should now contain only the start tags of the broken elements,
// the most deeply-nested start tag at the top
while (stack.length > 0) {
// pop the unmatched tag off the stack
var endTag = stack.pop();
// get just the tag name
endTag = endTag.substring(1,endTag.search(/[ >]/));
// append the end tag
string += "</" + endTag + ">";
}
// Return the well-balanced XHTML string
return(string);
}
Sample non-html text with pseudo tags
Description: Project Name: RMD Omni Sponsor Mailing 2018
100696 - 100696
Requested by: Amber Barnes *
Description: There are two different versions letter.
Please see the Vendor Mailing File - RMD Omni Plans column B for which letter to use for the mailing.
Each sponsor should receive the
Appropriate letter version (Force Out Auto Pay OMNI Sponsor Letter or QJSA OMNI Sponsor Letter) AND two generic attachments:
1. Instructions to Submit Participant Data
2. Date and Status Template
Please use return address PO Box 1583 Hartford, CT 06144 for the envelope.
Please provide merged mail proofs for me to review and approve before mailing.
VERSION DETERMINED BY <Letter Type> FIELD
****************
1a) Letter File Name for version 1: Force Out Auto Pay OMNI Sponsor Letter.doc
Qty: 492
1b) Letter File Name for version 2: QJSA OMNI Sponsor Letter.doc
Qty: 4
Flat Size: 8.5 x 11
Finished Size: 8.5 x 11
Stock: 60# Opaque Text
Printing: Black 2 sides
Pages: 3
Finishing: Letterfold
Special Instructions: Andrews to re-position the address block to accommodate #10 Window envelopes.
Data merge the following fields:
<<MailDate>>
<<PlanAdmin>>
<<PlanName>>
<<PlanAdd>>
<<PlanAdd2>>
<<PlanCity>>, <<PlanState>> <<PlanZip>>
<<PlanNumber>> Plan Name: <<PlanName>>
*******
2) Sample letter
File Name: Instructions to Submit Participant Data.pdf
1 page (1 sheet 1 side)
8.5 x 11
60# offset
Black simplex
(static)
*******
3) Sample letter
File Name: Date and Status Template.pdf
1 page (1 sheet 1 side)
8.5 x 11
60# offset
Black simplex
(static)
Note: Any call to the format function escapeHTML MUST be for user content. Called from formats:
a prefix means don't wrap links n prefix is no text wrapping
flex means the source could potentially be from classic in html format html means it's html content and does not exist in classic text means it's pure text displayed as html
Critically the return of this function is always html and marked user content only
There are 345 TextArea and 47 TextEditor controls .
A problem noted above is text that is input via a TextArea and that is then output as html using the flex or html formats.
Text can contain things that needs encoding for html display, they also could be pre formatted containing multiple sequential spaces or tabs.
The entities need escaping and spaces and formatting need preserving by wrapping in a pre tag.
The issue is how to identify these situations.
We could encode on a save from a textarea and decode when putting back in the TextArea. This is the most efficient as it happens only once and also we know for sure the content is text. But how do we know every TextArea output is really meant to be html?
Given there are 345 Text but only 47 RTF controls it seems that the vast majority are text input meant for text output so encode/decode at input is probably not feasible but who knows?
We could encode at run time on output but how do we know one string is pure text and should be encoded and is preformatted and that another string is just html and shouldn't be encoded?
Well if you set the format to 'text' that is an indicator that it is pure text and can be safely encoded but in reality there are format set as 'html' but really they are pure text. We need to support these as otherwise they are regarded as framework 'bugs' as this issue is an example of. So we scan every string to see if it contains html and since that is problematic we approximate that by looking for closing tags and void elements like img.
If you wrap text with pre it to preserve formatting it won't word wrap unless you put in returns, that's the point of preformatted, they are preformatted by the user.
You can add styling to make it wrap but then maybe you break someones pre formatting.
So what to do, every choice risks being wrong in some case?
and should we try to convert 'links' in text to be clickable?
word-wrap: break-word;//(alias for overflow-wrap) ie,chrome,safari overflow-wrap: break-word;//edge,firefox,chrome,safari
word-wrap ms proprietary value implemented by most, overflow-wrap the standard implemented by all modern browsers, but noticed the subtle issue, you need both to get full coverage!!
Yet another significant issue: removeFormat It was decided that the behavior of the browsers removeFormat command was not what we wanted as it only removes formatting (styles), instead we wanted it to remove all markup including the parent tag. Problems: 1) It destroys the undo stack, at best, at worst it screws up and it gets out of order and some parts become undoable, like our changes become top of the stack, so undo pops of the bottom so undoes things before our change, then our change in the end becomes undoable.
2) What is all markup anyway and how do you remove it? What if you select text-image-text, is the image markup? Yes, so remove it? Doesn't make sense. What if you select part of one tag, the whole of the next tag and part of another? What to remove, the parents are needed by the partial unselected text, there might be images nested in the tags maybe at many layers deep! What if the text is nested several layers deep rather than just being at the top, what if this was html pasted in from outside.
So what can we do? Find all nodes in the selection, hard to do but we recently created similar code in deliverables so we can adapt that, then we walk all the nodes, we can check each node against a 'blacklist' if it is on it we remove it and replace it, since we have to remove we need to process children first or all the references will be wrong. Stack is still screwed up but that maybe the best we can do.
https://padmin.workamajig.com/platinum/?aid=wjadmin.brain.edit&k=417149 Marlene - Found this issue today between Classic and Platinum. This is in Job ID # 1806295 Related to the print specification sections. See the below (3) screen shots. Apparently, Platinum does not like the following characters < > The specification preview screen does not show the words between those characters, but they do show in the editing specification screen. I think this is a glitch in the system and we need it fixed. Please share and let me know. Thanks,