Sanitise on input, escape on output. (Trac #561)

elgg-gitbot commented 11 years ago

Original ticket http://trac.elgg.org/ticket/561 on 38885458-12-20 by trac user judgej, assigned to unknown.

Elgg version: 1.1

If I use characters in a display name that would normally be used to define HTML tags, they get converted to HTML entities. I don't believe this is the correct action to take.

For example, if I enter the name "Admin>", then it saves the name "Admin>". If this is an input mapping, then it could have repercussions throughout the system. For example, if I used '>' or '<' in a password, then can I be sure those characters are actually going to be stored in the password? If I have a five character limit on an input field, then the what happens when I enter '&&&&&' and find the system expands it to '&&&&&' before it attempts to store it? What if I am entering data that has nothing whatsoever to do with HTML or XML? What relevance is that kind of 'sanitisation' of input?

Mapping of special XML characters to entities is a function of the output (including when displayed as pre-filled form items). It is only relevent when that kind of mapping is important to the output, such as when creating XML or HTML. It has no place being done in the path between the user and the database.

I believe this is a very important issue, because I have seen other projects flounder when they do not follow the simple rule of keeping XML mapping out of the path from the user to the database. It is just asking for double-encoding to happen in various output points, and then you are in serious trouble because your user-entered data has effectively been corrupted.

elgg-gitbot commented 11 years ago

trac user judgej wrote on 38885464-02-07

This is a similar issue to systems that attempt to sanitise input data by preparing it for database storage (i.e. escaping quotes and backslashes) too early. They run into trouble causing escape sequences to start appearing at the output, and then they need to put extra reverse-sanitisers in to fix that, and then that causes further problems and the whole thing gets ten times more complicated than it should be.

elgg-gitbot commented 11 years ago

trac user judgej wrote on 38885485-06-06

Another example is the 'location' field in the profile. If I enter "Tyne & Wear" then I expect it to be stored exactly like that. Instead, two things happen:

The string I enter is converted to "Tyne & Wear" which then gets stored like that.
On my profile page, the location is then displayed without any further conversion. This is dangerous because tags may have been added to that profile through a different API and may not contain HTML-safe characters.

In addition, this stored value becomes a new tag "Tyne & Wear" - but why? The tag should be "Tyne & Wear", because that is what is stored. This is where the output XML conversion is done too early, and then starts to impact other stages (ie. the tag generation) that actually need the raw data to operate on, and not the data that happens to have been converted for a specific output media (a web browser).

Supposing that location is now sent to a vCard, which is a non-XML text file? My address would appear as "Tyne & Wear", which is patently wrong.

Trust me - this must be sorted out and some very clear rules put into place at this early stage, otherwise you will have big data problems down the line, and will end up with so many layers of conversion that it could result in security problems down the line.

elgg-gitbot commented 11 years ago

Title changed from Strange mapping of display name characters to Sanitise on output not input [Was: Strange mapping of display name characters] by trac user marcus on 38996756-02-04

elgg-gitbot commented 11 years ago

trac user marcus wrote on 39115060-12-05

(In [svn:2709]) Refs #561: Split filtering into separate function

elgg-gitbot commented 11 years ago

trac user marcus wrote on 39117387-06-03

I am beginning to think that maybe we shouldn't be using kses for sanitisation in any case - it would appear that the project is dead (hasn't been updated since 2005), and I don't think we should be adopting it as part of elgg.

There is a case for stripping certain tags from input by default, but I agree that &amp encoding should not be performed.

elgg-gitbot commented 11 years ago

cash wrote on 39118718-07-05

As a point of reference, Wordpress and Drupal both use kses (though both projects may be updating the kses code since it isn't being maintained). Joomla uses custom filtering code.

Joomla has filtering on both input and output:[[BR]]

http://api.joomla.org/Joomla-Framework/Filter/JFilterInput.html [[BR]]

http://api.joomla.org/Joomla-Framework/Filter/JFilterOutput.html [[BR]]

Drupal only filters on output for more flexibility. They have a very useful extensible filtering system.[[BR]]

http://www.lullabot.com/articles/drupal_input_formats_and_filters [[BR]]

http://api.drupal.org/api/file/modules/filter/filter.module/7 [[BR]]

elgg-gitbot commented 11 years ago

trac user marcus wrote on 39120071-03-04

(In [svn:2724]) Refs #561: Removed &amp encoding from kses but retaining script input and entities.

elgg-gitbot commented 11 years ago

trac user marcus wrote on 39166783-12-15

(In [svn:3009]) Closes #828: Quite correct - rather tired - arrays are individually trimmed - non-arrays are not. Closes #714: Input filtering now triggers on a plugin hook, this allows plugins to provide other filtering methods than kses (Refs #561).

elgg-gitbot commented 11 years ago

trac user marcus wrote on 39443423-07-15

Note to self: At some point evaluate turning filtering off by default. For this to happen, output views must either filter tags or in some way sanitise output (should already occur).

elgg-gitbot commented 11 years ago

trac user darkwingduck wrote on 39554848-08-25

I would only filter on output (or on input, keeping the original stream) and keep the filtered stream in the db in a cache field, keyed with a hash (crc, preferably) of filter names + versions used to filter the cached data. And optionally cache the current filter chain hash in php.

So it would be flexible, open to modifications on existing data, and would require nearly the same cpu time with input-only filtering. Drupal does this right, though they have some small issues with their interface.

We may have input and output filters (or an input/output option on each filter), but then I wouldn't use an input filter on my server anyway, so I always have the original data.

elgg-gitbot commented 11 years ago

trac user darkwingduck wrote on 39554854-11-22

And optionally cache the current filter chain hash in php.

I mean, hold it in a php variable =)

elgg-gitbot commented 11 years ago

davetosh wrote on 39678375-08-30

Since v1.6, Elgg has switched to htmlawed http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php - how are people finding that?

elgg-gitbot commented 11 years ago

cash wrote on 39678589-04-22

I think the large issue is not what is being used to do the filtering, but when the filtering/escaping is happening. The general practice is to filter input for bad stuff and escape the output to get proper html (some also filter output using a different set of filters). Elgg has a tendency to escape the input causing problem elsewhere.

Here are some relevant articles:

http://devzone.zend.com/article/1767

http://shiflett.org/blog/2005/feb/my-top-two-php-security-practices

http://terrychay.com/blog/2007/12/10/php-advent-security-filter-input-escape-output/

elgg-gitbot commented 11 years ago

cash wrote on 39752539-03-12

To be more explicit - htmLawed as configured escapes certain characters on input rather than just restricting itself to filtering.

elgg-gitbot commented 11 years ago

Milestone changed to Elgg 1.8 by brettp on 40084594-06-04

elgg-gitbot commented 11 years ago

Title changed from Sanitise on output not input [Was: Strange mapping of display name characters] to Sanitise on input, escape on output. by brettp on 40084594-06-04

elgg-gitbot commented 11 years ago

trac user necronet wrote on 40471951-08-22

wouldt be a good idea to add a parameter on:

function serialise_object_to_xml($data, $name = "", $n = 0, $specialChar=true)

the boolean will allowed to know wether the specialChar should be called or not a good example of false would be for RESTful Methods or specify by the object....

elgg-gitbot commented 11 years ago

ewinslow wrote on 40693177-06-10

It seems like the main problem is that the current filtering/sanitization system assumes that every input is html (htmLawed is run on all inputs with reckless abandon). But I don't always mean to input html.

It seems like it might be ok to have htmlawed make it standards compliant if it's meant to be html input. For example, <p>&</p> becomes <p>&</p>.

However, when you input plain text (or anything else), you don't want that to happen. For example Jack & Jill should not become Jack & Jill. I also wouldn't like it if one of my blog titles was "Introducing Githubissues.

Githubissues is a development platform for aggregating issues.

elgg-gitbot / test

Sanitise on input, escape on output. (Trac #561) #58