PAhelper / PhoenixAdult.bundle

Plex Meta-Data Agent for scene videos from multiple adult sites
360 stars 140 forks source link

Metadata API #236

Open mrwoofa opened 5 years ago

mrwoofa commented 5 years ago

I found the metadata returned from this plugin to be quite good, but it's lacking in a lot of places. Plus it can take a very long time to find a single release.

So I'm working on a metadata database, somewhat like TVDB/TMDB that Plex uses already.

If anyone would like to lend a hand, reach out to me :) Hopefully once I'm done filling up the database, we could supplement this plugin with the API.

The scraping is done very similarly to this plugin, using XPath on the search pages, but all data is then stored in a database which allows fulltext searching.

It's working really well so far, just typing in [Sitename] [date] or [sitename] [title] brings up a result in 100ms with 99% accuracy.

Thanks

dxm2891 commented 5 years ago

It is an excellent idea. A single database should be created for all sites. So to do the complete search once every day and the plex plugin it would take very little time to search. A similar thing already existed but it was bad: data18.com

SgtBatten commented 5 years ago

Data18 was glorious.

I only can see a Database being useful if it is also storing the links to images or even the files themselves as a lot of effort (and plex search time) goes into getting posters.

Edit: I see you mean to use it in conjunction with something like this plugin. That could be very useful

ghost commented 5 years ago

One could host them in unlimited gdrive for reliable hosting

On Wed, Jun 26, 2019 at 8:28 AM SgtBatten notifications@github.com wrote:

Data18 was glorious.

I only can see a Database being useful if it is also storing the links to images or even the files themselves as a lot of effort (and plex search time) goes into getting posters.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PAhelper/PhoenixAdult.bundle/issues/236?email_source=notifications&email_token=ALXMKFPW2KEBOKQHG4IUH6TP4MEBPA5CNFSM4H3OJJLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYSPJOY#issuecomment-505738427, or mute the thread https://github.com/notifications/unsubscribe-auth/ALXMKFI5O25ZQNBB37JAGYLP4MEBPANCNFSM4H3OJJLA .

claygoldfinch commented 5 years ago

While I agree this is a nice idea, and possibly one of the best ways for the agent to scrape data, I think the main issue we might run into here is just the legal implications of hosting this data (images in particular).

ordinarygulp commented 5 years ago

Data18 was very well done, and for quite some time, had the legal right to all they did. I was always impressed with the number of genres they would have on a single scene.

@r0x0r316 that would not be wise. Not only is it likely against their ToS, but there is also a strong chance the host would hit the api limit rapidly, and the DB would then be unavailable.

Overall this is a good idea, that I'm sure this plugin could integrate with somewhat naturally. I am still a little concerned about the longevity, and scale of it. I have metadata for something like 22k scenes, and even then I barely scratched the collection that data18 had before it stopped updating. This would need a proper host. If done without regards to legal usage of the data, the host would need significant protection, as these studios have a LOT of money to put to legal use.

claygoldfinch commented 5 years ago

I think the best option, in terms of both legality and feasibility, is to scrap site maps when available. This has already been proposed by @d2dyno in Issue #177.

dxm2891 commented 5 years ago

But sitemap is only to get all link scenes. The best option is to get all scene with sitemaps and collect all information on webhost that allows adult content. But webhost has a price

claygoldfinch commented 5 years ago

Depends on the site. Check out the babes sitemap: https://www.babes.com/scenes_sitemap1.xml

dxm2891 commented 5 years ago

the problem with these sitemaps is that unfortunately they are restricted. We need to find the general from which the various ramifications lead. it is certainly better than the one I saw but it would always be advisable to do the management first via links and download the contents into a host or database and then do the whole thing. At most, if we don't want to insert images or anything else we can create an api only through links to the images to those already existing on the sites.

dxm2891 commented 5 years ago

Creating a database was already something I wanted to do 4 years ago and every time I gave up, so I regretted doing it now. In these months I will try to do tests. Whoever wants to add himself as a pragram-maker can tell me and let's do something. I think I create everything in PHP.

dxm2891 commented 5 years ago

For now I have created the Brazzers crawler and i take 50 scenes in only 1 minutes. For those who want a preview: http://theadultdb.com/ Login only for preview access: user: addb pass: $Yd984nmjd!

mrwoofa commented 5 years ago

Hey @dxm2891

Did you want to collab? I've actually come a long way into my database. I've got 200 sites in the database and 40,000 scenes.

I've also got it scraping the images and resizing them, for Plex, but I'm not sure of the legality of this. So i might end up just serving the image URLs instead of actually saving them.

I don't really have a nice frontend, like you do yet, I'm just focusing on the API and Database currently.

dxm2891 commented 5 years ago

@mrwoofa

As soon as I have something more concrete with pleasure, I want to first try to crawl 5 sites and retrieve the stars and genres to see if everything works well.

My purpose is to create the API, the graphic you see I used for another site and I just adapted it to see if everything works.

Once the APIs are created we will agree with @PAhelper , @claygoldfinch and the other collaborators to insert the sites that do not have research or that take a long time. The only difficult thing will be to handle many simultaneous requests.

If I can enter the email here (I don't know if it's against the rules) we can exchange details on our 2 jobs and see if we can work together.

As for images and trailers I save the link directly in the database to speed up, but I also save images for security and obscure them through HTTP request 403.

mrwoofa commented 5 years ago

@dxm2891

No problem. I've created quite a few scrapers now in my system, so we could potentially share our scrapers, since both are in PHP.

I'm currently working on a way to scrape sites that have Cloudflare in front of them that blocks scraping. Not sure if i should continue down this route, or allow certain videos to be crowdsourced.

ghost commented 5 years ago

For the simplicity and scale of things make the api without website access just for this and future plugins in mind anything else would bring unwanted attention. Make methods for retrieving content and search the api with flexibility in mind, for instance what should work in Plex should also work in emby Kodi and so on. And off course use apilimits and keys in order to avoid abuse of the service, maybe even a crypted Frontend in adulthelper would he the best option to keep spying eyes away and handle requests for new front ends for other media players on a personal and small base. Best would be no end user ever notices the api other than the huge speed increase

On Mon, Jul 1, 2019 at 8:53 PM dxm2891 notifications@github.com wrote:

@mrwoofa https://github.com/mrwoofa

As soon as I have something more concreted with pleasure, I want to first try to crawl 5 sites and retrieve the stars and genres to see if everything works well.

My purpose is to create the API, the graphic you see I used for another site and I just adapted it to see if everything works.

Once the APIs are created we will agree with @PAhelper https://github.com/PAhelper , @claygoldfinch https://github.com/claygoldfinch and the other collaborators to insert the sites that do not have research or that take a long time. The only difficult thing will be to handle many simultaneous requests.

If I can enter the email here (I don't know if it's against the rules) we can exchange details on our 2 jobs and see if we can work together.

As for images and trailers I save the link directly in the database to speed up, but I also save images for security and obscure them through HTTP request 403.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PAhelper/PhoenixAdult.bundle/issues/236?email_source=notifications&email_token=ALXMKFM6P7BBGZGXEAW24ZDP5JHA3A5CNFSM4H3OJJLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY7AXLA#issuecomment-507382700, or mute the thread https://github.com/notifications/unsubscribe-auth/ALXMKFLRCSTOLGWR2F63J2DP5JHA3ANCNFSM4H3OJJLA .

mrwoofa commented 5 years ago

@r0x0r316 I tihnk that's a great idea. That's how i was going to handle it.

I've already got the API going, using it personally, so you could already implement it into Plex.

But i don't have the wide range of sites this plugin has YET.

I've already laid the groundwork for API Keys and Rate Limiting, so we'll go from there.

CuddleBear92 commented 5 years ago

Might be a good idea to scrap data18 while its still up for the scene<>movie links they have. Otherwise it might be tricky to get the links as other sites don't list scenes for the movies at all. That is if you plan to have dvd's at all in anyway. Having both makes sense for sure and gives all the users options.

SgtBatten commented 5 years ago

Are you guys collecting image sets from third party sites if the original site does not have it.

I've had resonable success with the PAextras script but I have it fine tuned for the content I want.

tritnaha commented 5 years ago

Do we have any progress on this, @mrwoofa & @dxm2891 ?

mrwoofa commented 5 years ago

I've actually got a good API running, using it personally. I also wrote a Plex agent that matches things. It needs tweaking, but it works really well.

I have 360 sites, 15 scrapers, 150,000 releases, 20,000 actors in my database as of now, with more being added hourly.

I'm currently running it on my server in my garage, so i wouldn't open it up. Maybe one day i can get the funds together to put it on an actual server and open it up.

I'd also like to put up a repo for my scrapers, so people can contribute!

tritnaha commented 5 years ago

@mrwoofa Very nice! It´d be awesome to get this up and running, shoot me an email at abuse@tritnaha.com and we can take it from there.

ferengi82 commented 5 years ago

a repo would be great, i'd love to look at your code.

by the way: are you the mrwoofa from beisammen.de?

mrwoofa commented 5 years ago

Would people be open to having the API paid? The API is going to use quite a few resources, and I'm not in a position currently to pay for them.

Was thinking of making the API like $2 per month, but allow account sharing.

tritnaha commented 5 years ago

Would people be open to having the API paid? The API is going to use quite a few resources, and I'm not in a position currently to pay for them.

Was thinking of making the API like $2 per month, but allow account sharing.

I´m OK with paying for it, also able to offer resources in terms of hosting/hardware if that´s an issue.

ferengi82 commented 5 years ago

if the api works good i would maybe paying. But i would like it more if you make your code available on github. I think most of the people here have an homeserver where they can run the api for free

mrwoofa commented 4 years ago

Hey Guys,

Sorry it's been a while!

I've put my site live, although it's on a cheap server right now. So it'll be slow.

You can view it here: https://metadataapi.net/

You can search scenes, view what sites i support and view performer information.

Thoughts?

mrwoofa commented 4 years ago

I'm also adding a proposal to the Stash project to use scraping definitions so that we can all share them - https://github.com/stashapp/stash/issues/244

ghost commented 4 years ago

Simply awesome and runs smooth here, now all that’s missing is Plex integration

On Wed, Dec 4, 2019 at 11:14 PM mrwoofa notifications@github.com wrote:

I'm also adding a proposal to the Stash project to use scraping definitions so that we can all share them - stashapp/stash#244 https://github.com/stashapp/stash/issues/244

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PAhelper/PhoenixAdult.bundle/issues/236?email_source=notifications&email_token=ALXMKFLMKNS2PT7KOQKJTBDQXATWBA5CNFSM4H3OJJLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF6VZOA#issuecomment-561863864, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALXMKFNPZMEPFVEYHGXQSZDQXATWBANCNFSM4H3OJJLA .

mrwoofa commented 4 years ago

Yeah hopefully we can get something going!

grabby7 commented 4 years ago

mrwoofa are you still looking for help with this? I see your site is up and running. I have php experience and would love to help add sites.

mrwoofa commented 4 years ago

I've got full Plex integration going btw, works quite well.

Anyone interesting in the project, wanting to help out, donate or just chat, join our discord!

https://discord.gg/XpSGpaB