DoctorD1501 / JAVMovieScraper

Scrape XBMC and Kodi movie metadeta and automatically rename files for Japanese Adult Videos (JAV), American Adult DVDs, and American Adult Webcontent
GNU General Public License v2.0
751 stars 161 forks source link

Javlibrary not working again #279

Closed Borisbee1 closed 5 years ago

Borisbee1 commented 5 years ago

Describe the bug Looks like javlibrary isn't being scraped. Probably cloudflare issue again.

eckozen84 commented 5 years ago

would like to add that i am experiencing the same.

mut3k1 commented 5 years ago

Yes, same issue here for a while now :(

Wizell commented 5 years ago

Cloudflare changed the challenge few times already, i tried many things but the challenge changes too often and requires more and more support of browser functionnalities.

i'm considering some options: The first one is to add a parameter that request required cookies and informations. Those changes would force the user to start a browser, export the data and inject them into JAVMS.

Another option is for JAVMS to start a firefox with a specific profile, let it handle the challenge and extract informations from the stored data.

Please feel free to share your opinion on the matter

Borisbee1 commented 5 years ago

I've been playing around with it some myself. There is a working python scraper that can get around cloudflare. https://github.com/Anorov/cloudflare-scrape I was toying around with possibly adding a call to run that to get relevant info and then feed it back to JAVMS. I don't have a ton of experience with programming so it's been slow going for me, but I was able to make a quick Python script to parse in a page from javlibrary and read out the actor names. The harder part is feeding it back into JAVMS.

Any of your options would work as well. I hadn't thought of trying to feed JAVMS the cookie manually from browser.

Wizell commented 5 years ago

i quickly read through the code. This code is also solving the challenge by rebuilding the js code of the challenge using the same technic as the one currently used in jAVMS (except that its more up to date and it uses Request and Node.sj where JAVMS uses soup and nashorn)

The main shortcoming of this method is resilience, the code can't resist many changes. In JAVMS i tried to limit this impact by starting to mimic a browser (like mimic the DOM web API). Moreover, for each change the challenge code must be analysed and solver modified.

If possible i would also avoid adding more dependencies as on some OS it would make JAVMS more complex to run (like windows where you will need 2 more installers before being able to run JAVMS).

For those reasons i'm considering using a real browser to easily solve the challenge.

Feeding the cookies (and also at least the user-agent) works but you need to copy/paste manually or use a plugin to export those.

ideas so far:

Borisbee1 commented 5 years ago

Copying manually is probably the easiest solution that won't break down the line.

Wizell commented 5 years ago

Do you know how long are the cookies from cloudflare valid ?

Borisbee1 commented 5 years ago

they seem valid for a good while, I haven't timed it though. Probably at least 30 mins

Wizell commented 5 years ago

If so, it might be a rather good solution. It will only require to add support to set those values and documentation on how to extract the data from the browser.

Borisbee1 commented 5 years ago

It certainly lasts long enough to scan a set of movies so I think it's a good solution.

Wizell commented 5 years ago

I just tested and it does work as expected. I will redo the test with the same cookie set in 1 hour and tomorrow in order to have a rough estimate of how painful it will be.

The next question will be how to make the export process not too complex. May i have your opinion on this one ? Would it be better to just explain how to get it in web browsers web tools, make a webextension that do this or any other idea ?

Borisbee1 commented 5 years ago

I don't know how to get that information myself, but I do know how to look at cookies. If it's not that hard I'd say a simple explanation should suffice. From what I've read browser agent and cookies need to match up in order to verify.

Webextension would certainly make life easier but I don't know how much time that would take to make or how hard and it's also another thing to maintain.

I will say that manually inputting cookies/agent can't be any harder than what I have to do now when trying to manually copy over actors from javlibary.

Wizell commented 5 years ago

Would you mind to try it ? You are looking for the "__cfduid" and "cf_clearance" cookies and your browser user agent.

I must admit i do not know how to look at those without the web tools of the browser.

Borisbee1 commented 5 years ago

I plan on trying it a little later yea. Going to take me some time though, I'll have to find my user agent edit the code accordingly and make sure everything works correctly.

Wizell commented 5 years ago

Thank you. Please report after you tired.

You may also try it manually with a command like curl or wget.

Wizell commented 5 years ago

My cookies already expired. So it lasts less than 30mn

Borisbee1 commented 5 years ago

Good news, it works for me. The cookie has also lasted about 30 mins so far, although I did refresh the page midway through so perhaps the cookie is refreshed if you're actively browsing. At least I can now scrape javlibrary manually when I need to by making hardcode changes.

I will note that the browser.addCookie() function in the browserConfigure() function inside JavLibraryParsingProfile.java doesn't seem to work correctly. I had to add the cookies directly into the connection inside DitzyHeadlessBrowser

rahadiancs commented 5 years ago

Can we use the alternative domain for it? Currently it's on c32r.com (it's changed regularly, but when it changed, the old domain redirected to the new one). I don't know if it's also using cloudflare, but if it isn't then adding a configuration dialog to enter the alternative domain might be a simpler solution.

Wizell commented 5 years ago

c32r.com does not use cloudflare anti-bot protection.

But from a security point of view, i'm concerned by some questions even if the scraper should not execute javascript code.

I don't understand why this website exists and what are then actual links between it and javlibrary.

What quickly found out:

Can you tell me more on what you know about this domain ? How did you know it exists ?

rahadiancs commented 5 years ago

Before c32r.com, it was d28k.com (it's still works and redirects to c32r.com now). I found it long time ago from google search with this keyword "[any javid] javlibrary" (because I'm too lazy to open javlibrary then search it from there), the alternative domain showed up on the result. And since then I always use the alternative domain. My user account on javlibrary also works there, so it's either a mirror, or simply alternative domain pointing to the same server.

Wizell commented 5 years ago

Thanks a lot for the domains.

For now, i went the cookie way as it seems to be the most future proof one. But if this way is too complicated or annoying i might revert to using those other domains

see version 0.3.7.

eckozen84 commented 5 years ago

@Wizell Javlibrary still does not seem to be working for me with the new release. Not sure if I'm doing something wrong but when I start the scrape, the scan box appears for 2 seconds and then closes. Running it on Windows 10.