Closed Borisbee1 closed 5 years ago
would like to add that i am experiencing the same.
Yes, same issue here for a while now :(
Cloudflare changed the challenge few times already, i tried many things but the challenge changes too often and requires more and more support of browser functionnalities.
i'm considering some options: The first one is to add a parameter that request required cookies and informations. Those changes would force the user to start a browser, export the data and inject them into JAVMS.
Another option is for JAVMS to start a firefox with a specific profile, let it handle the challenge and extract informations from the stored data.
Please feel free to share your opinion on the matter
I've been playing around with it some myself. There is a working python scraper that can get around cloudflare. https://github.com/Anorov/cloudflare-scrape I was toying around with possibly adding a call to run that to get relevant info and then feed it back to JAVMS. I don't have a ton of experience with programming so it's been slow going for me, but I was able to make a quick Python script to parse in a page from javlibrary and read out the actor names. The harder part is feeding it back into JAVMS.
Any of your options would work as well. I hadn't thought of trying to feed JAVMS the cookie manually from browser.
i quickly read through the code. This code is also solving the challenge by rebuilding the js code of the challenge using the same technic as the one currently used in jAVMS (except that its more up to date and it uses Request and Node.sj where JAVMS uses soup and nashorn)
The main shortcoming of this method is resilience, the code can't resist many changes. In JAVMS i tried to limit this impact by starting to mimic a browser (like mimic the DOM web API). Moreover, for each change the challenge code must be analysed and solver modified.
If possible i would also avoid adding more dependencies as on some OS it would make JAVMS more complex to run (like windows where you will need 2 more installers before being able to run JAVMS).
For those reasons i'm considering using a real browser to easily solve the challenge.
Feeding the cookies (and also at least the user-agent) works but you need to copy/paste manually or use a plugin to export those.
ideas so far:
Copying manually is probably the easiest solution that won't break down the line.
Do you know how long are the cookies from cloudflare valid ?
they seem valid for a good while, I haven't timed it though. Probably at least 30 mins
If so, it might be a rather good solution. It will only require to add support to set those values and documentation on how to extract the data from the browser.
It certainly lasts long enough to scan a set of movies so I think it's a good solution.
I just tested and it does work as expected. I will redo the test with the same cookie set in 1 hour and tomorrow in order to have a rough estimate of how painful it will be.
The next question will be how to make the export process not too complex. May i have your opinion on this one ? Would it be better to just explain how to get it in web browsers web tools, make a webextension that do this or any other idea ?
I don't know how to get that information myself, but I do know how to look at cookies. If it's not that hard I'd say a simple explanation should suffice. From what I've read browser agent and cookies need to match up in order to verify.
Webextension would certainly make life easier but I don't know how much time that would take to make or how hard and it's also another thing to maintain.
I will say that manually inputting cookies/agent can't be any harder than what I have to do now when trying to manually copy over actors from javlibary.
Would you mind to try it ? You are looking for the "__cfduid" and "cf_clearance" cookies and your browser user agent.
I must admit i do not know how to look at those without the web tools of the browser.
I plan on trying it a little later yea. Going to take me some time though, I'll have to find my user agent edit the code accordingly and make sure everything works correctly.
Thank you. Please report after you tired.
You may also try it manually with a command like curl or wget.
My cookies already expired. So it lasts less than 30mn
Good news, it works for me. The cookie has also lasted about 30 mins so far, although I did refresh the page midway through so perhaps the cookie is refreshed if you're actively browsing. At least I can now scrape javlibrary manually when I need to by making hardcode changes.
I will note that the browser.addCookie() function in the browserConfigure() function inside JavLibraryParsingProfile.java doesn't seem to work correctly. I had to add the cookies directly into the connection inside DitzyHeadlessBrowser
Can we use the alternative domain for it? Currently it's on c32r.com (it's changed regularly, but when it changed, the old domain redirected to the new one). I don't know if it's also using cloudflare, but if it isn't then adding a configuration dialog to enter the alternative domain might be a simpler solution.
c32r.com does not use cloudflare anti-bot protection.
But from a security point of view, i'm concerned by some questions even if the scraper should not execute javascript code.
I don't understand why this website exists and what are then actual links between it and javlibrary.
What quickly found out:
Can you tell me more on what you know about this domain ? How did you know it exists ?
Before c32r.com, it was d28k.com (it's still works and redirects to c32r.com now). I found it long time ago from google search with this keyword "[any javid] javlibrary" (because I'm too lazy to open javlibrary then search it from there), the alternative domain showed up on the result. And since then I always use the alternative domain. My user account on javlibrary also works there, so it's either a mirror, or simply alternative domain pointing to the same server.
Thanks a lot for the domains.
For now, i went the cookie way as it seems to be the most future proof one. But if this way is too complicated or annoying i might revert to using those other domains
see version 0.3.7.
@Wizell Javlibrary still does not seem to be working for me with the new release. Not sure if I'm doing something wrong but when I start the scrape, the scan box appears for 2 seconds and then closes. Running it on Windows 10.
Describe the bug Looks like javlibrary isn't being scraped. Probably cloudflare issue again.