general feed_data parsing issues

evcordeiro commented 13 years ago

I noticed an issue specifically on tumblr posts, somewhere in the parse we are converting non alphabet characters, for example:

avid is the kid8217s English name He laughs every time I try to pronounce his real name but he can8217t say mine either And besides he8217s the one killing me off on a regular basis At first it was teacher die After weeks of hard work though he8217s grasped that teacher dies The 8216s8217 David remember the s Recen

ghost commented 13 years ago

The previous bug that was causing the oAuth error with tumblr was that the content of the post contained some html tags and characters that tumblr did not recognize. So in the index.php I added: $information['content'] = preg_replace("/[^a-zA-Z0-9\s]/", "", strip_tags(stripHTML($sitemap->entry[$count]->content))); to strip out what tumblr didn't accept. It should just need to be tweaked to fix this bug.

evcordeiro commented 13 years ago

i noticed that. what exactly was the error tumblr was giving us? i looked at the tumblr api and it seems that 'body' should be able to accept most characters, it can take html. what about playing with the 'format' variable?

regardless, i think it would be better to move the regex and striptag stuff into the plugin code. that way each plugin can do what it needs to with the tags. for instance parsing tags to include pictures in facebook and tumblr posts

evcordeiro commented 13 years ago

To bring this thread up to date, function parseFeed() in index.php is the beginning of this issue

the parse begins here: $xmlstr = file_get_contents($query['urlid']);

$sitemap = simplexml_load_string($xmlstr);

then makes its way to the plugins (/plugins/sno_*.php) function postToAPI()

there is a lot of experimental mucking about in the tumblr plugin, but heres a bit of it

//echo "unmod content:<br>";
//echo "<pre>" . $information['content'] . "</pre>";
//echo "<br><br>striptags htmlentititydecode<br><br>";

//$cont = htmlentities($information['content'], ENT_QUOTES | ENT_IGNORE );
//echo $cont;
//echo (strip_tags(html_entity_decode($cont, ENT_QUOTES)));
/*
echo (strip_tags(html_entity_decode($information['content'], ENT_NOQUOTES, 'ISO-8859-1')));
echo (strip_tags(html_entity_decode($information['content'], ENT_QUOTES, 'ISO-8859-15')));
echo (strip_tags(html_entity_decode($information['content'], ENT_COMPAT, 'UTF-8')));
*/

My thoughts:

Starting from the top is the best, each plugin should be passed an -unmodified- feed. By unmodified I mean an arbitrary standard format but no data (such as tags) removed. Non standard stuff (esp quotes) need to be handled at this level.

There is a lot of stuff we can use, I looked briefly at php's xml_parser_create() and xmlset*, it might be a good place to start. pregreplace to delete non standards should ideally be used not at all on this level, but if we need to bug report it.

evcordeiro / SNOctopus

general feed_data parsing issues #24