Closed thezoggy closed 9 years ago
so on this one, if you just stripped \n in the middle you'd end up with ...t them.Aired who...
so maybe replace \n with space first, then replace double space with single. This would normalized it to '...t them. Aired who...'.
{
"firstaired": "2002-09-01",
"imdb_id": "tt0320969",
"language": "en",
"lid": 7,
"seriesname": "Stargate: Infinity",
"network": "FOX",
"overview": "The animated action/adventure Stargate Infinity continues the saga of the men and women of Stargate Command as they travel the universe using the extraordinary powers of the mysterious Stargate portals. Stargate Infinity is the story of veteran Stargate explorer Major Gus Bonner and a group of young Air Force Academy cadets. Wrongly accused of treason, they must flee across the universe, pursued by a ferocious new alien enemy, the Tlak'khan - mercenaries working for the Nax'kan Council. The team must find a way to clear Gus' name and to protect the mysterious Draga - a strange alien being who may be the key to unlocking the ultimate secrets of the Stargate and of the Ancients who built them.\n\nAired as education TV, most episodes come complete with a nifty kids' message!",
"seriesid": "70852",
"id": 70852
}
Also, how would you recommend going about filter out the 'duplicate' items when searching?
(usually they are * duplicate ### * but the pattern has changed over the years several times, and who is doing them) for example search for: Leonard
, one of the entries is:
{
"language": "en",
"lid": 7,
"seriesname": "*** duplicate 132411 *** Leonard",
"seriesid": "250628",
"id": 250628
}
I would guess we would have to inspect each seriesname and if there is a match for duplicate \d+
then to delete it?
If you edit the dict as you're looping through it, it'll give you an error: RuntimeError: dictionary changed size during iteration
maybe an option to hide/remove duplicates from the output?
also as a side note, found this quirk: http://forums.thetvdb.com/viewtopic.php?f=17&t=15329
example of duplicate shows: http://www.thetvdb.com/?string=duplicate&searchseriesid=&tab=listseries&function=Search
looks like there are different formats.. so prob more trouble than its worth to try and hide these?
example of duplicate shows: http://www.thetvdb.com/?string=duplicate&searchseriesid=&tab=listseries&function=Search
Those seem like they should just be deleted by the TVDB moderators..
notice that there is a \t and leading spaces for one of the aliases
This I'm slightly torn about..
On the one hand, there's no sane reason to have tabs/newlines/etc in the data, so striping it out would be harmless
..but on the other hand, it seems wrong to do such cleanup in the API client. Surely TheTVDB should be more stringent in what it accepts? Would be better for all users to clean the data at the source, rather than fixing things in one client
I'm siding towards the latter option, particularly because earlier today I fixed one series which had a several \x19
bytes an episode summary, something which theTVDB should reject (and definitely not output, since it's causes an invalid XML file..)
yeah, ive asked the tvdb gods about disallowing html in the overview section (since it causes malformed xml). ive asked them about sanitizing the showname input since the unicode apostrophe isnt supposed to be used ( http://forums.thetvdb.com/viewtopic.php?f=18&t=15310&p=57742 ) since in the end it just messes up searches / causes duplicate shows in the end - since people end up adding both.
As much as i hate touching the data to mask tvdb flaws.. it would be nice to be able to just do some of it if you wanted. The overview issue.. is easy enough to go fix. The series aliases however require a mod to do it since its restricted by who can modify them.
I reached out to them about the duplicates here: http://forums.thetvdb.com/viewtopic.php?f=7&t=15398&p=57693 The issues about the aliases I had them fixed, http://forums.thetvdb.com/viewtopic.php?f=18&t=15303 I wouldn't have know anything was wrong with them until I used the api... since they are somewhat hidden and really end up only be used internally during the search routine.
Also just fyi, http://forums.thetvdb.com/viewtopic.php?f=17&t=15329
When searching for 'Betrayal' the show 'Betrayal!' is showing up first, instead of the exact show name.
Per tvdb:
Because punctuation is stripped in order to get matches. So as far as the search is concerned the two are exactly the same. It might be possible to adjust this in Sphinx but as we're hoping to change the search entirely it's unlikely anyone is going to try adjusting the current method.
so just to recap,
dbr, i'd like to talk to you sometime about how sickbeard uses the api to see if we should be doing some things differently.
Also, recently you added the aliasnames which show up fine during the .search() routine, but if you just show the t[####].data they arent there.
notice that there is a \t and leading spaces for one of the aliases:
xml doesnt obtain aliases.. http://thetvdb.com/api/9DAF49C96CBF8DAC/series/80347/all/en.xml
--added post to tvdb forums about this, maybe they can do their part to prevent this from happening in the future: http://forums.thetvdb.com/viewtopic.php?f=17&t=15279
leading space in one of the aliases,
this one has random
\n
in the overview,xml that shows the \n are still there: http://thetvdb.com/api/9DAF49C96CBF8DAC/series/81568/all/en.xml
internally I've done something like:
which cleans up things a bit.. ive seen show overview in the past contain html tags as well - but thats a whole other battle :(
I know there are plenty that probably dont want anything to be modified from what is on the tvdb.. for others I'm sure it would be nice to see
.strip(' \t\n\r')
done on each element (at least the aliases/overview which seem to be the main source of problems) - the problem with strip is that it only does leading/trailing while the\n
in the middle of the string wouldnt be touched :(