sanitation? - Githubissues

thezoggy commented 11 years ago

notice that there is a \t and leading spaces for one of the aliases:

            {
                "aliasnames": [
                    "\t  Shipwrecked: Battle of the Islands", 
                    "Shipwrecked: The Island"
                ], 
                "firstaired": "2007-01-20", 
                "language": "en", 
                "lid": 7, 
                "seriesname": "Shipwrecked", 
                "network": "Channel 4", 
                "overview": "This is a British reality contest show with a twist. Instead of trying to eliminate others, the contestants try to get people to choose to hang out with them on \"their island.\" Ten people are split between two islands. One group is called Tigers, the others Sharks. Each week, a new contestant arrives and spends 3 days with the Tigers, and then 3 days with the Sharks. On the seventh day, the groups all get together and have a \"beach party\" during which the new arrival must decide who they want to stay with for the remainder of the show. Whichever island has more people at the end of 5 months wins \u00a370,000.", 
                "seriesid": "80347", 
                "id": 80347
            },

xml doesnt obtain aliases.. http://thetvdb.com/api/9DAF49C96CBF8DAC/series/80347/all/en.xml

--added post to tvdb forums about this, maybe they can do their part to prevent this from happening in the future: http://forums.thetvdb.com/viewtopic.php?f=17&t=15279

leading space in one of the aliases,

            {
                "aliasnames": [
                    "Fight Ippatsu! Juuden-chan!!", 
                    "Fight Ippatsu! J\u016bden-chan!!", 
                    " Fight, One Shot! Charger Girls!!"
                ], 
                "firstaired": "2009-06-25", 
                "language": "en", 
                "lid": 7, 
                "seriesname": "Charger Girl Ju-den Chan", 
                "overview": "From a planet called \"Life Core\", which exists parallel to the normal human world, females known as \"J\u016bden-chan\" (charger girls) are patrolling the human world in search for individuals who feel depressed and unlucky. These people are ranked from A to F, F being normal and A being near suicidal. When the J\u016bden-chan find targets ranked C or higher, they charge these people up with the help of electricity in order to improve their mental states. Whilst normally unseen by human eyes, one of these J\u016bden-chan, Plug Cryostat, accidentally meets a young man who is able to see her, because she was targeting his father (his sister in the anime). This series revolves around the various antics between the main characters and the quest for this J\u016bden-chan to improve herself.", 
                "seriesid": "103291", 
                "id": 103291
            },

this one has random \n in the overview,

            {
                "language": "en", 
                "lid": 7, 
                "seriesname": "Conspiracies", 
                "overview": "Sky One aims to sort reality from rumour with four more intriguing cases in the new series of Conspiracies. Danny Wallace is back seeking answers to some disturbing questions. Did the CIA play geopolitics with lives at Lockerbie? Is the government conducting a top-secret alien programme? Did the Nazis invade England\u2019s green and pleasant land? And was the FBI responsible for mass murder at Waco? \nTravelling the globe on a definitive search for the truth, Danny talks to those who claim to be in the know. Each one hour episode dissects a different conspiracy: The Alien Evidence, MI5 Nazi Invasion, Carnage at Waco, Lockerbie and the CIA. \n\nConspiracies challenges the culture of control and secrecy of our governments as Danny Wallace goes in search of the answers to these alarming questions. Cover up or cock-up? This illuminating series aims to find out. \n \n \n", 
                "seriesid": "81568", 
                "id": 81568
            },

xml that shows the \n are still there: http://thetvdb.com/api/9DAF49C96CBF8DAC/series/81568/all/en.xml

internally I've done something like:

cshow["overview"] = cshow["overview"].encode("UTF-8", "ignore").replace('  ', ' ').strip(' \t\n\r').replace('\n', '')

which cleans up things a bit.. ive seen show overview in the past contain html tags as well - but thats a whole other battle :(

I know there are plenty that probably dont want anything to be modified from what is on the tvdb.. for others I'm sure it would be nice to see .strip(' \t\n\r') done on each element (at least the aliases/overview which seem to be the main source of problems) - the problem with strip is that it only does leading/trailing while the \n in the middle of the string wouldnt be touched :(

thezoggy commented 11 years ago

so on this one, if you just stripped \n in the middle you'd end up with ...t them.Aired who... so maybe replace \n with space first, then replace double space with single. This would normalized it to '...t them. Aired who...'.

            {
                "firstaired": "2002-09-01", 
                "imdb_id": "tt0320969", 
                "language": "en", 
                "lid": 7, 
                "seriesname": "Stargate: Infinity", 
                "network": "FOX", 
                "overview": "The animated action/adventure Stargate Infinity continues the saga of the men and women of Stargate Command as they travel the universe using the extraordinary powers of the mysterious Stargate portals. Stargate Infinity is the story of veteran Stargate explorer Major Gus Bonner and a group of young Air Force Academy cadets. Wrongly accused of treason, they must flee across the universe, pursued by a ferocious new alien enemy, the Tlak'khan - mercenaries working for the Nax'kan Council. The team must find a way to clear Gus' name and to protect the mysterious Draga - a strange alien being who may be the key to unlocking the ultimate secrets of the Stargate and of the Ancients who built them.\n\nAired as education TV, most episodes come complete with a nifty kids' message!", 
                "seriesid": "70852", 
                "id": 70852
            }

Also, how would you recommend going about filter out the 'duplicate' items when searching? (usually they are * duplicate ### * but the pattern has changed over the years several times, and who is doing them) for example search for: Leonard, one of the entries is:

            {
                "language": "en", 
                "lid": 7, 
                "seriesname": "*** duplicate 132411 *** Leonard", 
                "seriesid": "250628", 
                "id": 250628
            }

I would guess we would have to inspect each seriesname and if there is a match for duplicate \d+ then to delete it? If you edit the dict as you're looping through it, it'll give you an error: RuntimeError: dictionary changed size during iteration

maybe an option to hide/remove duplicates from the output?

thezoggy commented 10 years ago

also as a side note, found this quirk: http://forums.thetvdb.com/viewtopic.php?f=17&t=15329

thezoggy commented 10 years ago

example of duplicate shows: http://www.thetvdb.com/?string=duplicate&searchseriesid=&tab=listseries&function=Search

looks like there are different formats.. so prob more trouble than its worth to try and hide these?

dbr commented 10 years ago

example of duplicate shows: http://www.thetvdb.com/?string=duplicate&searchseriesid=&tab=listseries&function=Search

Those seem like they should just be deleted by the TVDB moderators..

notice that there is a \t and leading spaces for one of the aliases

This I'm slightly torn about..

On the one hand, there's no sane reason to have tabs/newlines/etc in the data, so striping it out would be harmless

..but on the other hand, it seems wrong to do such cleanup in the API client. Surely TheTVDB should be more stringent in what it accepts? Would be better for all users to clean the data at the source, rather than fixing things in one client

I'm siding towards the latter option, particularly because earlier today I fixed one series which had a several \x19 bytes an episode summary, something which theTVDB should reject (and definitely not output, since it's causes an invalid XML file..)

thezoggy commented 10 years ago

yeah, ive asked the tvdb gods about disallowing html in the overview section (since it causes malformed xml). ive asked them about sanitizing the showname input since the unicode apostrophe isnt supposed to be used ( http://forums.thetvdb.com/viewtopic.php?f=18&t=15310&p=57742 ) since in the end it just messes up searches / causes duplicate shows in the end - since people end up adding both.

As much as i hate touching the data to mask tvdb flaws.. it would be nice to be able to just do some of it if you wanted. The overview issue.. is easy enough to go fix. The series aliases however require a mod to do it since its restricted by who can modify them.

I reached out to them about the duplicates here: http://forums.thetvdb.com/viewtopic.php?f=7&t=15398&p=57693 The issues about the aliases I had them fixed, http://forums.thetvdb.com/viewtopic.php?f=18&t=15303 I wouldn't have know anything was wrong with them until I used the api... since they are somewhat hidden and really end up only be used internally during the search routine.

Also just fyi, http://forums.thetvdb.com/viewtopic.php?f=17&t=15329

When searching for 'Betrayal' the show 'Betrayal!' is showing up first, instead of the exact show name.

Per tvdb: Because punctuation is stripped in order to get matches. So as far as the search is concerned the two are exactly the same. It might be possible to adjust this in Sphinx but as we're hoping to change the search entirely it's unlikely anyone is going to try adjusting the current method.

so just to recap,

show names that are duplicates can just be handled by the user since per the api there isnt a flag/easy way to just filter. It has to be done via the name which could pose issues..
aliases that are messed up - escalate to tvdb mods.. could trim the input but 99.9% of the time its not needed.. so just wasting cpu cycles
overview - would be nice to have an option that tries and cleans this up or leaves untouched. since honestly this is the most likely area to have issues (which partly isnt really the user fault since people copy and paste / wanna make it look nice for the webui / no sanitation is done on the field)

thezoggy commented 10 years ago

dbr, i'd like to talk to you sometime about how sickbeard uses the api to see if we should be doing some things differently.

Also, recently you added the aliasnames which show up fine during the .search() routine, but if you just show the t[####].data they arent there.

dbr / tvdb_api

sanitation? #35