TeamNewPipe / NewPipe

A libre lightweight streaming front-end for Android.
https://newpipe.net
GNU General Public License v3.0
30.27k stars 2.98k forks source link

[Feature request] Add locale selection in settings #68

Closed gregarkhipov closed 8 years ago

gregarkhipov commented 8 years ago

Since YouTube deliberately assings language based on user's location, I always get russian videos, even while search query is in pure English. It really sucks, because YouTube decides for me what I should watch.

But there is a way to fix it in desktop, by changing site version to 'Worldwide' in the site footer.

It would be great to feature this in the app. As far as I know, a language is set as an url parameter, so you don't even need cookies for this to work.

theScrabi commented 8 years ago

All right, i'll keep that in mind :)

medavox commented 8 years ago

I'll try implementing this tomorrow, using the URL parameter chschtsch pointed out. theScrabi, this will probably mean adding a field to YoutubeSearchEngine (probably in the Uri.Builder at the start of search(String, int):Result). Not sure yet if the StreamingService interface will need an update.

medavox commented 8 years ago

OK, I've implemented the country selection setting as suggested; n the Settings menu, there is a new option: "Content Country". When this is any option other than null, the following is added to the search query:

&gl=<code>&persist_gl=1

where <code> is a standard 2-letter country code, currently stored in values/settings_keys.xml

The only problem is, I can't find a search string which produces noticeably different results for different country codes.

Any ideas?

theScrabi commented 8 years ago

Hm, shekhar12020 is doing changes on that part of the code. You need to ask him if it's OK to modify it. Anyway please add a language parameter to that function. Like this: Result search(String query, String lang, int serviceId);

You also have to add this parameter in the SearchEngine interface.

theScrabi commented 8 years ago

@medavox I noticed that it doesn't matter witch country code you add to the url. It will always show the same results.

medavox commented 8 years ago

Which class does that search() method signature exist to? In SearchEngine I have the version before my edits Result search(String query, int page), and this after: Result search(String query, int page, String contentCountry)

I don't have Result search(String query, String lang, int serviceId) in my copy of the project, and I'm up-to-date with the repo!

and OK, I will ask him.

medavox commented 8 years ago

@theScrabi Yes I've noticed this too.

theScrabi commented 8 years ago

Oh I'm sorry, I just said you should add such a parameter. It does not matter where you put it at. :)

medavox commented 8 years ago

OK I will add int serviceId; but what about int page?

medavox commented 8 years ago

You know, if this commit has no value, we can always revert it. If this method doesn't work, then there's no point keeping it. Using the gl=<code> method on YouTube in a desktop browser DOES change the frontpage videos, but nothing else that I can see.

gregarkhipov commented 8 years ago

Welp, I just played with languages on YouTube and it seems that changing only country code will affect only main YouTube page, but searching videos will work the same with any country selected. (Just as @medavox pointed it out)

I just realized, that my browser is in English, and therefore it requests English content in http headers every time. So in desktop I actually get Russian page of Youtube, but with English user interface.

But since the app might use simplified methods to make http requests, when using the app, I get a page like if both values was set to RU. (Although, my Android is in English as well) So, to make my idea work, we need to either implement requested language selection alongside with country code, or use system language. Or also, we could do the latter, while keeping in mind to add manual language selection of the app (other than system language) later on. (I guess, this decision affects structure of code, and it's always good to make things extendable). Hope you got what i mean?

BTW, guys, good work anyway!

gregarkhipov commented 8 years ago

So now I'm trying to find out, how to set language rather than country. It is a bit more complex and hl= won't work (YouTube uses ajax and cookies there).

Now I see that country code is not needed for my goal, but if I'm not mistaken, I saw out there some enhancement request to fill the app's start page with one from YouTube. Then, if we'll need this feature (country selection) anyway, let it just be?

gregarkhipov commented 8 years ago

Check this out! When I request a search page using wget or curl, by default I get page in Russian with Russian results (because of my ip address), but if I set --header 'Accept-Language: en', then I get page in English, also with proper English results. It works! The only thing we need is this http header in the request.

medavox commented 8 years ago

It sounds like you've found a method that works! Now we just need to translate this into Java.

So far, it looks like we'll need to add the header to the HTTP request, which will be set on the HttpURLConnection in Downloader.download(), using java.net.URLConnection.setRequestProperty(String, String).

I'll take another look tomorrow, although I'll have to be more careful not to upset everyone else's working sets. I don't want to collide with anyone else's changes.

Unless you'd like to try this yourself @chschtsch ?

gregarkhipov commented 8 years ago

Yes, why not. I'll give it a shot.

theScrabi commented 8 years ago

@medavox I'm sorry (again). I wasn't aware that the integer in the search() function was actually the page parameter. I thought it was the serviceId. search() does not need to know its own id, please don't add that parameter. Result search(String query, int page, String contentCountry) is actually the way I wanted it to be, and you already did it that way xD.

medavox commented 8 years ago

@theScrabi Oh good :) I'm glad. @chschtsch, let me know how you get on. One reason I didn't implement last night (midnight, local time) was because yesterday getting the countryCode (or now language code, I suppose) from the preferences system to the SearchEngine class was not as easy as I'd thought, due to SearchEngine not having access to a Context (to my knowledge) which it could pass to PreferenceManager.getDefaultSharedPreferences(Context) to get the SharedPreferences object.

Obviously I ended up just passing the string itself from higher up the call stack (in VideoItemListFragment.run(), where I did have access to a Context,

theScrabi commented 8 years ago

Yes, please try to keep as much android related stuff as possible away from the extractors, and try to abstract such things. Because it will ensure that we could easily port extractors over to other platforms, if it was necessary one day. So the way medavox implemented it (through a parameter in the search() function) is the right one.

gregarkhipov commented 8 years ago

I tried manually adding this only line con.setRequestProperty("Accept-Language", "en"); to Downloader.download and it worked.

But I probably don't yet understand architecture of the app enough to create a new setting for language code (not touching the existing countryCode setting, as I suggested earlier) and to bind it up with the rest of code.

I don't know about country code, but I think that language code parameter should be assigned right in Downloader class, because this is a low-level stuff and preferred language should be sent in http headers every time the app makes a request. Just as it works in a web browser.

But talking about country code, it could be set in Downloader class as well, since every page have set of url parameters. So it could be like siteUrl + "gl=" and we'll end up always getting a valid address. But it's a bad way to do that and there should be another solution.

What do you think?

theScrabi commented 8 years ago

It seems like shekhar12020 has finished his work on SearchEngine and YoutubeSearchEngine: https://github.com/theScrabi/NewPipe/pull/73

@chschtsch you could add a function Donwload.donwload, with the String contentCountry parameter in addition. So we can call this function from YoutubeSearchEngine.search().

gregarkhipov commented 8 years ago

@theScrabi, sorry, I understand what you mean, but I still don't quite understand how it all works. To do that, I need some time to learn.

If anybody wants to add language selection before me, this may be useful gist:b0fd4e8b45ad663d1c8a (languages I scraped from YouTube and formatted to XML)

theScrabi commented 8 years ago

Also i'm being silly again. Do not add a language parameter to Download.downlaod(), since this function simply loads a page requested by the url given as its parameter. If something should handle language than it is SearchEngine.search().

medavox commented 8 years ago

@theScrabi you're right, a language parameter is outside the scope of Downloader.download(), because it only downloads arbitrary data from a given URL. However, in order to set the HTTP header field Accept-Language, access to the HttpURLConnection object which is instantiated and used in Downloader.download() is needed.

So we need to do something differently:

  • Create a new version of download() with a different method signature? (my preferred choice)
    • eg download(String siteUrl, String languageCode)
  • Refactor Downloader so that the HTTPURLConnection object is externally available?
  • something else...?

You're right about keeping Android and YouTube stuff self-contained. Even if porting to another platform doesn't happen, Android APIs seem to change every year.

medavox commented 8 years ago

OK, I've tweaked the existing code, and (as far as I can tell) you can now search YouTube in different languages!

@chschtsch, I used your technique for adding the accepted languages field to the HTTP header. @theScrabi, I created another version of Download with a different method signature: Downloader.download(String, String) : String. Common code between the normal and the language-parameter versions of download() has been moved to a private method, dl().

gregarkhipov commented 8 years ago

Nicely done, mate! It works as it should.

But is it possible to set language globally? And what do you think about this idea? I just upgraded regexes for extracting date and views count, but it's working only with English, since every language has it's own formatting.

English could be used by default for pages where date and views count are extracted. Then it could be converted to date and int objects respectively and then formated again based on user device's language.

theScrabi commented 8 years ago

@chschtsch I recognized that regex for date didn't work that proper. So i changed it here. May still not be working.

gregarkhipov commented 8 years ago

Yeah, that's the point, you made it work with German format :)

theScrabi commented 8 years ago

Hm. At first i thought about putting the regex pattern into a language specific string file. But i don't think that's not such a good solution. What do you think?

gregarkhipov commented 8 years ago

As I mentioned, English could be used by default for extracting. This way, you always parse page in English language with English-specific pattern, and then you transform it to date object (I think there is some function in Java for that). And then you transform this date object to string based on device language with another Java function. Same with views (which also vary in fomat over different languages).

medavox commented 8 years ago

Although it would require downloading a video page up to 2 times (once for the english page, and for the preferred locale), I personally think that solution would be better than language-specific regexes, which would be horrible to maintain (if/when YouTube changes the page designs somehow).

That's just my opinion though. @chschtsch and thanks :) but what do you mean by set language globally? Default it to the device default setting? Because I had considered that.

gregarkhipov commented 8 years ago

Here is what I have in mind.

English is always default language for downloading pages. Thus, it's global for requests (but not for presenting the content). When page is downloaded, it's parsed with English-specific regexes. This way, we always get the same result, no matter in which country a user lives, or which language he preferred.

Then, all language/country-specific items like date are converted to objects. This is done to be able to convert them to user's country/language system he set as preferred.

Then these items are actually converted to language of user's system. So this way, you don't even need to download page two times, but only one time, because all preferred locale content is generated in the app.


Here is some example to make it clear: First of all, the app downloads page. Then it scrapes content:

Title: Justin Bieber - Baby ft. Ludacris Channel name: JustinBieberVEVO Likes: 3,498,222 Dislikes: 5,259,228 Views: 1,234,539,113 views Date: Uploaded on Feb 19, 2010 Description: Music video by Justin Bieber performing Baby feat. Ludacris.

VEVOCertified on April 25, 2010. http://www.youtube.com/vevocertified

Then the app transforms it to another locale. Let's say, German. In most cases, title, channel name and description stay unchanged. But date and all numbers have different format. So at the first stage the app transforms it to objects (int for numbers and date for date) and we have these unformatted:

Likes: 3498222 Dislikes: 5259228 Views: 1234539113 Date: Date(2010 02 19)

And here comes another Java function for formatting. For date I think this would be something like DateFormat df = DateFormat.getDateInstance(DateFormat.LONG, Locale.GERMANY);:

Likes: 3.498.222 Dislikes: 5.259.228 Views: 1.234.539.113 Date: 19.02.2010

And then 'views' and 'published on' in German are added:

Views: 1.234.539.113 Aufrufe Date: Veröffentlicht am 19.02.2010

And here we go, with single page downloading we have completely formatted content for any locale. This method works for any YouTube page, and later on, when the app will have other services like Vimeo, it won't need changes, since English is de-facto default (if not the only) language for all major online services. That's why I stick so much to using English by default.

theScrabi commented 8 years ago

@chschtsch I,m not sure but I think I've seen, that the date is also delivered as json value inside the player arguments. And i think its an integer value rather than a string. If so it would not be a problem to use this instead.

gregarkhipov commented 8 years ago

Oh, I see. It would be great, if you could extract these from player meta tags rather than parsing other, preformatted div tags in box below the video. Although, functions for formatting these values into user's locale would be needed then anyway.

gregarkhipov commented 8 years ago

Now, I could implement extracting and formatting, if you would show me how to access locale setting from YoutubeExtractor class.

medavox commented 8 years ago

Currently working on extracting video data from meta tags and json player arguments. The thing slowing me down is converting this info to locale-specific strings.

So far, I've got the meta field interactionCount (which can also be extracted from the player arg JSON field view_count), and datePublished. Both of these need converting to more human-friendly, locale-specific values (adding commas/dots/spaces etc).

Should I commit these changes so you can all see them?

medavox commented 8 years ago

Also API-wise, I think it would be a good idea to split up Extractor.getVideoInfo() into a set of smaller interface methods, such as getVideoTitle() and getDescription(), so that implementing support for a new streaming service is more guided. Implementers would just implement the individual extraction methods for each VideoInfo field.

That way, it's harder to forget to initialise a VideoInfo field when implementing new streaming services. It also allows each method to use fall-back scraping techniques, and abstract this away into a called method. For instance, as said above, the view count on a Youtube video can be extracted from either the <meta> tag, or the view_count field in the JSON player args. If one fails, the other technique could be silently used instead.

Any thoughts?

gregarkhipov commented 8 years ago

Concerning converting these values, check out the folowing StackOverflow answers: http://stackoverflow.com/a/1951875 http://stackoverflow.com/a/4375470

medavox commented 8 years ago

Thanks for these links; I was in the process of writing a long clarifying question to you, when I realised my question was answered in the Locale doc :)

I'll implement this locale-specific formatting for View Count and Publication Date (the two fields I've changed already), then push these changes to origin.

gregarkhipov commented 8 years ago

Alright! Could you also do this for likes/dislikes?

medavox commented 8 years ago

Certainly; Right now, I'm just deciding where to put the functionality for converting Youtube language codes to java.util.Locales.

medavox commented 8 years ago

OK, I've implemented locale-specific formatting of views, likes and dislikes, and the video publish date;

but it looks kind of weird to have the search language locale number formatting next to the device language strings, eg "123.456 Views".

Should we display the interface strings in the chosen language, or format the numbers and dates as the device language?

"123,456 Views" or "123.456 Aufrufe"? or even keep "123.456 Views"?

@theScrabi ?

gregarkhipov commented 8 years ago

I built the latest version and everything except likes works perfectly! So we can close the issue now. About likes, it just needs another way to parse them. I'm already trying to solve this. Log: 11-12 11:12:24.659 2755-10791/org.schabi.newpipe E/class org.schabi.newpipe.youtube.YoutubeExtractor: failed to parse likesString "3 502 129" and dislikesString "" as integers The problem here, as we can see, is that likes are not in player meta tags, so they are preformatted for my country once again.

gregarkhipov commented 8 years ago

Problem solved. It just should be replaceAll("[^\d]", "").

theScrabi commented 8 years ago

@medavox about supporting other services. My intention was that VideoItem detail is filled with invalid values at the beginning. Than each individual service could fill in the information it can provide, and leave out the rest. The UI should later on recognize that, and only display the given information.

I don't know if tha's a good way. Your suggestion however would guarantee that programmers implementing a service know what they should provide, and what they can't.

medavox commented 8 years ago

@chschtsch thanks for the fix! The way I implemented this, I thought that the English page would always be downloaded, but apparently not. I couldn't test that, though. That is a more robust replace regex :)

@theScrabi I think that methods in an interface would make the requirements for implementing support for a new site more explicit. It could also help to split up issues with scraping into smaller sub-problems.

Now that this issue is fixed (hooray!), shall we move this refactoring discussion to a new issue?

theScrabi commented 8 years ago

Finally :) good work.