TeamNewPipe / NewPipeExtractor

NewPipe's core library for extracting data from streaming sites
GNU General Public License v3.0
1.35k stars 405 forks source link

Change Regex and Selector Strings to Data/Language (Live) Parsing #77

Open FlorianSteenbuck opened 6 years ago

FlorianSteenbuck commented 6 years ago

The Problem

In this project the most data grabing things are done with a quick and dirty way (like always on the web), by using regex and selector strings.

Theoretic

The problem with this is regex and selector strings are not made for grabing data from the source. Selector strings come from css and jQuery and are mostly designed for create static websites (css) and interact with the xml of the html (jQuery/bouncer). Regex goes more in the direction of data grabing. But both got no high level functions implemented (join, trim, split) and on a lowlevel base the implementation is hard and mostly cost resources. Also Regex is not made for Language Parsing which made it hard for use it in the web.

Practical

This is used for parsing the client_id of soundcloud app from the javascript code. ,client_id:"(.*?)" This regex is problematic because it depends on a name, even depends on a name writing norm and on a language norm of defining variables.

The name is less problematic than the other factors, because currently the most web projects are not completly minified, but still is a problem, because the complete minifying is possible and can change the name of key, id and endpoint variables.

The writing norm is somehow the same game like the name, but even on top of these this is not completly normed on the web. Because the Developers come from different directions some from python that write in the pip norm, some from Java with the Cemal Case Norm, even pure Javascript/EMACscript Developer or just other Developer that do not realy follow a Norm.

The language norm is the most problematic one Javascript/EMACscript is one of the most complex languages that exists, this caused by the number of version that get published without breaking with old norms. In this case the problem is the set/init(:) and split/separator(,) operator. Both got different characters that can be used, because their can be viewed in different contexts. The version of this regex strongly depends on shorthand object definition ({key:value,key:{...}, ...}) and even do not using trimmed strings, that easly can be passed by one simple space. Data can also be asigned by the using these = (, ; .) chars and a endpoint operator let var this.

[var |let |this.|x.]key=value,key=value,...

Solution

Language Norm Problem

Parsing languages and data formats. Live from the stream or from the completely downloaded data. Using a set of the following solutions for the single case. :smile:

Best solution should be a whole new project or a web service that providing the id, keys, endpoints and other data are needed for using the private and public apis of the services.

Name Problem

Here you can use regex or programming rules on the extracted value, to get the right formated value and try it with try and error. Maybe you need to filter the values because of the attack with simply spam right values into the source code.

Writing Norm Problem

Parse different norms over and over again. Of Course this can be attack easily but with word parsing it can be nearly impossible. Maybe create a service for the ones who using the app for submit new words through the word database.

theScrabi commented 6 years ago

Thanks for your submission. Seems you did quite some work here.

I am aware of the problems we have with regex, therefore I've tried to avoid them as much as possible. However you are right, there are some places where this could be prevented. So as always help is highly welcome ;D

Altough for the Youtube part I can tell that the project works fine for what it is intended to.

Best solution should be a whole new project or a web service that providing the id, keys, endpoints and other data are needed for using the private and public apis of the services.

Actually I thought about making it possible to make the parser work on a server somwhere with a JSON api. However we are not quite there yet. But also here: Help is always welcome.

ghost commented 6 years ago

@theScrabi: Wouldn't that cost us more and wouldn't that be illegal (because of copyright and stuff)?

FlorianSteenbuck commented 6 years ago

@wb9688 Yes it cost us more if we want to create a api server (but is somehow awesome).

Extract all licenses is possible problematic is only if they provide a do not copy notice or license in the script/data. This can be still legal, but maybe some law system got other view on this point and would say we agree something when we visit the website, see the website as a whole instead of peaces or they say we leak something that need to stay private. Anyway we currently provide somehow this data through the newpipe app in the memory of the device and also provide routines to get these data. So it's kind of the same stuff here in this repo. I only suggest to get this separated from this project. I don't want to leak Amazon Buckets :smiley:

I am not a lawyer

Maybe you got more thought about it or knowledge of legal action that have been done on developers of apis or public credential rountines. let me know

theScrabi commented 6 years ago

@theScrabi: Wouldn't that cost us more and wouldn't that be illegal

Depends, it does not mean we have to host it, we could make something like that self hosted for the beginning. However I'm not interested in offering an illegal service!