EchterAlsFake / PHUB

A lightweight API for Pornhub
https://phub.rtfd.io
GNU General Public License v3.0
67 stars 22 forks source link

Query data structure flaw #43

Closed Egsagon closed 7 months ago

Egsagon commented 8 months ago

Hey

So with 4.3.2 these properties are available on all video objects: watched, is_free_premium and preview (and some more but they are not related to the issue).

Normally, any property comes from one of these sources:

However, the properties i mentionned above are not available or too complicated to get from any of these sources. Instead, they only come from a query page. For instance, the watched property comes from the "watched" tag you can see on a video thumbnail before clicking on it. Obviously, we can't get this information anywhere else from a query (unless we decide to irerate the entire account history).

If the video comes from a query

It is fine, the properties found on the query page are already stored and can be parsed on demand.

If the video does not come from a query (e.g. from Client.get)

PHUB will "simulate" a query. This is what the video._as_query property does (unless overrid by a query). It creates a temporary playlist on the client account, add the video to it and creates a new Query object with the playlist page.

Appart from the fact that this method is really not optimized (it costs 4 requests just to get whether one video has been watched), it is also not working because it uses the video page token, which means it needs to fetch the video page, and therefore consider the video as already watched, so the whole playlist procedure is useless.

This bug also affects other query-based properties, even though for some of them we can find a workaround (e.g. for the preview property, there might be a way to reconstruct the source url with the video key/id and other informations).

It might be possible to use a different token (like the query token) but this would require each video to store their queries.

What to do

You decide where these query properties should be implemented. Some of them (like watched and is_free_premium) are most likely to be used while iterating a query.

For exemple, as of now you can do:

for video in query.sample(filter = lambda vid: not vid.watched and vid.is_free_premium): ...

But another possible implementation could be:

for video in query.sample(watched = False, free_premium = True, ...): ...

This second implementation might look less user friendly because it has less of an OOP style but would make sure that these properties are safe to use. It would also make possible to not wrap each video in a Video object for optimization purposes since the query data we want is directly accessed from a regex (consts.re.eval_video).

Have fun

Egsagon commented 7 months ago

Solved with 0cac5de516c9502405728278c980522f609e7bd1, decided to implement the second solution in the sample method while still having the properties in the video object.

EchterAlsFake commented 7 months ago

Thanks a lot :)