Andre-Joosten / mp-onlinevideos2

Automatically exported from code.google.com/p/mp-onlinevideos2
0 stars 0 forks source link

Searching on non-UTF (here windows-1251/cyrillic) sites returns no results #105

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
> What steps will reproduce the problem?
1. Hmm. If you have a russian keyboard... or pick-up symbols using charmap - 
let's say: клон  - all the 4 of the same line.
2. add to OV list "Kino-Dom.tv (TV Series)"  - or another example attached 
below and try to search in russian.
3. no results returned. try in english (let's say - "poirot" - and voila)

> What is the expected output? What do you see instead?
Try:
http://filmix.net/index.php?story=%EF%F3%E0%F0%EE&do=search&subaction=search
or
http://kino-dom.tv/index.php?story=%EA%EB%EE%ED&do=search&subaction=search

> What version of MediaPortal are you using? On what operating system?
latest, 1.1.3 / OnlineVideos 0.30 / Windows XP Pro SP3 (English or Russian  - 
same result).
> What skin?
Blue3

> Please provide any additional information below.
First of all, I'm not sure if it is OnlineVideos issue, but I cannot check if 
this true for any other plugin because of mentioned circumstances.
The problem is encoding used by most of russian websites - windows-1251:
http://filmix.net/engine/opensearch.php
http://kino-dom.tv/engine/opensearch.php
With any browser it works, thanks to symbols escaping. But it seems MP 
transforms cyrillic symbols in some different manner? 
Could you pls check, what could be a problem?
Thank you. 

==========
    <Site name="Filmix" util="GenericSite" agecheck="false" enabled="true" lang="ru">
      <Configuration>
        <item key="dynamicCategoriesRegEx"><![CDATA[<li><a\shref="(?<url>\/[a-z0-9]+)">(?<title>[^<]*)<\/a><\/li>]]></item>
        <item key="dynamicCategoryUrlFormatString"><![CDATA[{0}]]></item>
        <item key="videoListRegEx"><![CDATA[<a\shref=['"](?<VideoUrl>[^"']+\/(?<id>\d+)-[^"']+)['"][^>]*>(?:<h\d>)?\s*(?<Title>[^<]+)(?:<\/h\d>\s*)?<\/a>(?:.+?id=['"]news-id-\k<id>['"])[^>]*>.+?<a\shref=['"](?<ImageUrl>[^"']+\.(?:jpg|gif|png))['"][^>]*>(?:\s*<img[^>]+>\s*</a><\!\-\-TEnd\-\->\s*(?:<div[^>]*>)?(?<Description>.+?)\s*<a)?]]></item>
        <item key="videoListRegExFormatString"><![CDATA[{0}]]></item>
        <item key="nextPageRegEx"><![CDATA[class=['"]pages["']>\s*.+?<a\s+href=["'](?<url>[^"]+/page/\d+)[^\d<]+</a>]]></item>
        <item key="nextPageRegExUrlFormatString"><![CDATA[{0}]]></item>
        <item key="prevPageRegEx"><![CDATA[class=['"]pages["']>\s*<a\s+href=["'](?<url>[^"]+/page/\d+)]]></item>
        <item key="prevPageRegExUrlFormatString"><![CDATA[{0}]]></item>
        <item key="playlistUrlRegEx"><![CDATA[(?:file=(?<url>[^\&"]*?\.xml)\&(?:.+)|sharer?\.php\?ur?l?=(?<url>[^'"]+)['"](?:.+))]]></item>
        <item key="fileUrlRegEx"><![CDATA[(?:file=(?<m0>[^\&"]*?\.flv)\&|<title>(?<n0>[^<]+)<\/title>\s*(?:<creator>[^\(<]+(?:(?=\()(?:(?<n1>[^\)]+)\)<)|(?<n1><))\/creator>\s*)?<location>(?<m0>[^<]+)<\/location>\s*<\/track>)]]></item>
        <item key="searchUrl"><![CDATA[http://filmix.net/index.php?do=search&subaction=search&result_num=50&story={0}]]></item>
        <item key="searchPostString" />
        <item key="baseUrl"><![CDATA[http://filmix.net/]]></item>
      </Configuration>
      <Categories />
    </Site>

Original issue reported on code.google.com by maxbal...@gmail.com on 28 Jul 2011 at 1:58

GoogleCodeExporter commented 8 years ago
Are you actually able to enter cyrillic symbols in MediaPortal? I set my 
windows text input to Russian keyboard and that works on the website you 
mentioned. But when I try that in MediaPortal in the search box, I can for the 
heck not enter any cyrillic symbol - all kinds of other latin letters and 
number appear :(. Is that possible on your end?

Original comment by bborgsd...@gmail.com on 31 Jul 2011 at 12:10

GoogleCodeExporter commented 8 years ago
Hm. I have a "C:\Program Files\Team 
MediaPortal\MediaPortal\plugins\Windows\Dialogs.dll" replaced with attached one 
on my main MP PC, to put cyrillic symbols using a remote. But to be sure the 
issue not caused by that 3rd-party file, checked that with stock Dialogs.dll 
and yes, it is possible to use regular keyboard to enter some cyrillics in a 
searchbox.

Original comment by maxbal...@gmail.com on 31 Jul 2011 at 1:24

Attachments:

GoogleCodeExporter commented 8 years ago
I have no chance to enter cyrillic text in MediaPortal :( How are you doing it?
Attached is a screenshot of what happens when I try (setting my text input to 
Russian which works fine in notepad).

Original comment by bborgsd...@gmail.com on 1 Aug 2011 at 12:22

Attachments:

GoogleCodeExporter commented 8 years ago
Hello. Wow, you have even changed whole MP interface language! Appreciate your 
efforts! Have you tried to replace Dialogs.dll I attached to my previous 
comment? 
By doing so and navigating then to "ACCENTS" button you should be able to 
switch to virtual cyrillic keyboard instead of extra latin symbols. Does it 
work? 

Original comment by maxbal...@gmail.com on 1 Aug 2011 at 12:40

GoogleCodeExporter commented 8 years ago
What version of MP does that Dialogs.dll run with? I have 1.2 beta and SVN on 
my dev PC, where it won't work :(
And using the original MP1.2 beta, selecting ACCENTS, I only get french style 
letters ;)

Original comment by bborgsd...@gmail.com on 1 Aug 2011 at 1:01

GoogleCodeExporter commented 8 years ago
I'm currently using it with 1.1.3/WindowsXP. Took that file more than a year 
ago from team-mediaportal.RU, so it is most likely non-official. 
Also found original thread: 
http://www.forum.team-mediaportal.ru/index.php/topic,445.0.html  - 
incompatibility with 1.2 was also stated there. pls try attached one for 
1.2.0alpha, they also recommend there to replace C:\Program Files\Team 
MediaPortal\MediaPortal\Core.dll when on 1.2.0alpha: 
http://www.forum.team-mediaportal.ru/index.php/topic,445.msg15358.html#msg15358 
  - attached here as well. 

Original comment by maxbal...@gmail.com on 1 Aug 2011 at 1:31

Attachments:

GoogleCodeExporter commented 8 years ago
Can you test attached dll? It should now URL escape those cyrillic letters 
before trying to retrieve the data.

Original comment by bborgsd...@gmail.com on 8 Aug 2011 at 3:17

Attachments:

GoogleCodeExporter commented 8 years ago
Sorry, I do not see any difference: 
both dlls  - this one and installed with ver. 0.31 send URLs in the same 
fashion, like:
[x]ttp://filmix.net/index.php?do=search&subaction=search&result_num=10&story=%D0
%BC%D0%B0%D0%BC%D0%B0 
(UTF-8 -> URL-encoded)
when the server accepts Cyrillic input in (CP1251 -> URL-encoded), like: 
[x]ttp://filmix.net/index.php?do=search&subaction=search&result_num=10&story=%EC
%E0%EC%E0

I put dll into C:\Program Files\Team 
MediaPortal\MediaPortal\plugins\Windows\OnlineVideos - is that correct? 

Original comment by maxbal...@gmail.com on 8 Aug 2011 at 3:58

GoogleCodeExporter commented 8 years ago
I see what you mean, but I cannot find a method in the .net framework to URL 
encode my cyrillic string in c# to those escape sequences. I always get the 
other ones that are twice as long. It must have something to do with unicode 
encoding. I'll try to research further but let me know if you find a method 
that can convert a c# string with cyrillic unicode letters to those codes.

Original comment by bborgsd...@gmail.com on 8 Aug 2011 at 9:30

GoogleCodeExporter commented 8 years ago
I found a way:
searchstring = 
System.Web.HttpUtility.UrlEncode(System.Text.Encoding.GetEncoding("Cyrillic").Ge
tBytes(searchstring))

For this to work you will need to write your own siteutil (simply inherit from 
GenericSiteUtil) and override the search function to convert the search string 
as written above. Can you do that?

I'll check if I can maybe another parameter "encoding" to the GenericSite, 
which is taken into account on all webrequests and search. That would fix your 
other issue as well. All you'd need to do then would be to set the encoding to 
"Cyrillic".

Original comment by bborgsd...@gmail.com on 8 Aug 2011 at 9:43

GoogleCodeExporter commented 8 years ago
Can you try this dll? In addition now you need to set overrideEncoding to 
Cyrillic in the site's advanced configuration.

Original comment by bborgsd...@gmail.com on 8 Aug 2011 at 10:50

Attachments:

GoogleCodeExporter commented 8 years ago
Thanks for this great work. Now, search works just perfect, "overrideEncoding" 
set to "Windows-1251". 
But, small issue appears with this dll, even for sites without that 
overrideEncoding settings - pls have a look at attached screenshot. The problem 
NOT solid, and I cannot say what is pattern to make it appear so far. 

Original comment by maxbal...@gmail.com on 9 Aug 2011 at 10:15

Attachments:

GoogleCodeExporter commented 8 years ago
What exactly IS the problem in that screenshot? If you mean that there are 11 
pages of choices - they come from a regex that matched that often? It did not 
before? I have made other changes lately about hosterurl resolving but I doubt 
that would be the problem?

Original comment by bborgsd...@gmail.com on 9 Aug 2011 at 11:31

GoogleCodeExporter commented 8 years ago
Ahh, yes, forgot to clarify :-) The problem is encoding again. On the 
screenshot above it looks like wrong UTF-CP1251 encoding we faced couple 
comments earlier. 
It should look like at attached pic. Number of choices is OK, as this is for 
TV-series. 
.
Still not clear for me, when exactly the problem appears: sometimes only for 
sites with the "overrideEncoding" setting, but sometimes for any of them. 
Sometimes, right from MP launch, sometimes later. 

Original comment by maxbal...@gmail.com on 9 Aug 2011 at 12:13

Attachments:

GoogleCodeExporter commented 8 years ago
Just to make it clear... Subj. of issue seems to be resolved: 
- cyrillic search for mentioned win-1251 websites now DOES return results, 
- those results seem to be relevant to a keyword,
- search results (matches of videoListRegEx, right?) encoding is correct, no 
issues.
"New" issue IS NOT related to any searching actions, it appears sometimes on 
fileUrlRegEx results pop-up window as result of both searching or just 
categories/videolist browsing.

Original comment by maxbal...@gmail.com on 9 Aug 2011 at 12:20

GoogleCodeExporter commented 8 years ago
This issue was closed by revision r1487.

Original comment by bborgsd...@gmail.com on 9 Aug 2011 at 2:25