Open elgg-gitbot opened 11 years ago
trac user judgej wrote on 38885464-02-07
This is a similar issue to systems that attempt to sanitise input data by preparing it for database storage (i.e. escaping quotes and backslashes) too early. They run into trouble causing escape sequences to start appearing at the output, and then they need to put extra reverse-sanitisers in to fix that, and then that causes further problems and the whole thing gets ten times more complicated than it should be.
trac user judgej wrote on 38885485-06-06
Another example is the 'location' field in the profile. If I enter "Tyne & Wear" then I expect it to be stored exactly like that. Instead, two things happen:
In addition, this stored value becomes a new tag "Tyne & Wear" - but why? The tag should be "Tyne & Wear", because that is what is stored. This is where the output XML conversion is done too early, and then starts to impact other stages (ie. the tag generation) that actually need the raw data to operate on, and not the data that happens to have been converted for a specific output media (a web browser).
Supposing that location is now sent to a vCard, which is a non-XML text file? My address would appear as "Tyne & Wear", which is patently wrong.
Trust me - this must be sorted out and some very clear rules put into place at this early stage, otherwise you will have big data problems down the line, and will end up with so many layers of conversion that it could result in security problems down the line.
Title changed from Strange mapping of display name characters
to Sanitise on output not input [Was: Strange mapping of display name characters]
by trac user marcus on 38996756-02-04
trac user marcus wrote on 39115060-12-05
(In [svn:2709]) Refs #561: Split filtering into separate function
trac user marcus wrote on 39117387-06-03
I am beginning to think that maybe we shouldn't be using kses for sanitisation in any case - it would appear that the project is dead (hasn't been updated since 2005), and I don't think we should be adopting it as part of elgg.
There is a case for stripping certain tags from input by default, but I agree that & encoding should not be performed.
cash wrote on 39118718-07-05
As a point of reference, Wordpress and Drupal both use kses (though both projects may be updating the kses code since it isn't being maintained). Joomla uses custom filtering code.
Joomla has filtering on both input and output:[[BR]]
http://api.joomla.org/Joomla-Framework/Filter/JFilterInput.html [[BR]]
http://api.joomla.org/Joomla-Framework/Filter/JFilterOutput.html [[BR]]
Drupal only filters on output for more flexibility. They have a very useful extensible filtering system.[[BR]]
http://www.lullabot.com/articles/drupal_input_formats_and_filters [[BR]]
http://api.drupal.org/api/file/modules/filter/filter.module/7 [[BR]]
trac user marcus wrote on 39120071-03-04
(In [svn:2724]) Refs #561: Removed & encoding from kses but retaining script input and entities.
trac user marcus wrote on 39166783-12-15
(In [svn:3009]) Closes #828: Quite correct - rather tired - arrays are individually trimmed - non-arrays are not. Closes #714: Input filtering now triggers on a plugin hook, this allows plugins to provide other filtering methods than kses (Refs #561).
trac user marcus wrote on 39443423-07-15
Note to self: At some point evaluate turning filtering off by default. For this to happen, output views must either filter tags or in some way sanitise output (should already occur).
trac user darkwingduck wrote on 39554848-08-25
I would only filter on output (or on input, keeping the original stream) and keep the filtered stream in the db in a cache field, keyed with a hash (crc, preferably) of filter names + versions used to filter the cached data. And optionally cache the current filter chain hash in php.
So it would be flexible, open to modifications on existing data, and would require nearly the same cpu time with input-only filtering. Drupal does this right, though they have some small issues with their interface.
We may have input and output filters (or an input/output option on each filter), but then I wouldn't use an input filter on my server anyway, so I always have the original data.
trac user darkwingduck wrote on 39554854-11-22
And optionally cache the current filter chain hash in php.
I mean, hold it in a php variable =)
davetosh wrote on 39678375-08-30
Since v1.6, Elgg has switched to htmlawed http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/index.php - how are people finding that?
cash wrote on 39678589-04-22
I think the large issue is not what is being used to do the filtering, but when the filtering/escaping is happening. The general practice is to filter input for bad stuff and escape the output to get proper html (some also filter output using a different set of filters). Elgg has a tendency to escape the input causing problem elsewhere.
Here are some relevant articles:
http://devzone.zend.com/article/1767
http://shiflett.org/blog/2005/feb/my-top-two-php-security-practices
http://terrychay.com/blog/2007/12/10/php-advent-security-filter-input-escape-output/
cash wrote on 39752539-03-12
To be more explicit - htmLawed as configured escapes certain characters on input rather than just restricting itself to filtering.
Milestone changed to Elgg 1.8
by brettp on 40084594-06-04
Title changed from Sanitise on output not input [Was: Strange mapping of display name characters]
to Sanitise on input, escape on output.
by brettp on 40084594-06-04
trac user necronet wrote on 40471951-08-22
wouldt be a good idea to add a parameter on:
function serialise_object_to_xml($data, $name = "", $n = 0, $specialChar=true)
the boolean will allowed to know wether the specialChar should be called or not a good example of false would be for RESTful Methods or specify by the object....
ewinslow wrote on 40693177-06-10
It seems like the main problem is that the current filtering/sanitization system assumes that every input is html (htmLawed is run on all inputs with reckless abandon). But I don't always mean to input html.
It seems like it might be ok to have htmlawed make it standards compliant if it's meant to be html input. For example, <p>&</p>
becomes <p>&</p>
.
However, when you input plain text (or anything else), you don't want that to happen. For example Jack & Jill
should not become Jack & Jill
. I also wouldn't like it if one of my blog titles was "Introducing Githubissues.
Original ticket http://trac.elgg.org/ticket/561 on 38885458-12-20 by trac user judgej, assigned to unknown.
Elgg version: 1.1
If I use characters in a display name that would normally be used to define HTML tags, they get converted to HTML entities. I don't believe this is the correct action to take.
For example, if I enter the name "Admin>", then it saves the name "Admin>". If this is an input mapping, then it could have repercussions throughout the system. For example, if I used '>' or '<' in a password, then can I be sure those characters are actually going to be stored in the password? If I have a five character limit on an input field, then the what happens when I enter '&&&&&' and find the system expands it to '&&&&&' before it attempts to store it? What if I am entering data that has nothing whatsoever to do with HTML or XML? What relevance is that kind of 'sanitisation' of input?
Mapping of special XML characters to entities is a function of the output (including when displayed as pre-filled form items). It is only relevent when that kind of mapping is important to the output, such as when creating XML or HTML. It has no place being done in the path between the user and the database.
I believe this is a very important issue, because I have seen other projects flounder when they do not follow the simple rule of keeping XML mapping out of the path from the user to the database. It is just asking for double-encoding to happen in various output points, and then you are in serious trouble because your user-entered data has effectively been corrupted.