Alternative approaches to extracting links

Cartmanishere commented 4 years ago

Zippyshare keeps changing their link construction logic.

And I am relying on the fact that they will not make it hard enough that I can't solve it using some regex.

The day they decide to crack down on this, we will lose the ability to extract zippy links.

I am creating this issue so that if anyone has an alternative solution/idea on how something like what this repo does can be implemented in a reliable way, I would love to have a discussion around that.

DavidRT commented 4 years ago

Hey there, did you try with selenium or something that allows getting the content just after all javascript was loaded?

tyeeman commented 4 years ago

How does a browser do it? They never need updating. Forgive me if I'm simplifying too much but please explain what they do.

name01019 commented 4 years ago

Forgive me if I'm oversimplifying

Can we implement a strategy where we grab the href directly from a browser element inspector? <a id="dlbutton" href="***"> <div class="download"></div> </a> The link would be in the ***

Alternatively, if we look at the source of a zippyshare website, we see that the formula they use is pasted directly in there (something like (n + n * 2 + b)). Can the program just directly copy that formula and solve it?

Cartmanishere commented 4 years ago

@DavidRT I have thought about it. But I found using selenium from Python unwieldy. And in general, it can be quite unreliable where there is a lot of javascript that gets activated.

That said, I haven't exactly applied selenium for this task. It is worth experimenting with.

@tyeeman That's the idea. We want to do something similar to what a browser does instead of using our own logic. But the browser is a very complex piece of software. It has a javascript engine in it for running js code. It is not trivial to programmatically execute a webpage and its js exactly the way browser does.

But we only need to execute enough JS code so that we can extract the link. Is that possible? Maybe. We'll have to experiment. But one thing we'll definitely need for doing that is a javascript engine which we can use to execute JS code from Python. There are some options that you can use to do this like selenium and phantomjs.

One more concern I have with doing something like this is that it is also not trivial to setup the dependencies for such a script. People with less experience will have trouble setting up. And it might also lose ability to run cross-platform.

One other thing I'm considering is porting the script from Python to NodeJS. In node, we'll get a builtin V8 engine that works on all platforms. And we can leverage this engine to execute webpage javascript using something like jsdom.

@name01019 You are correct. We need to emulate a click on the Download Button and then extract the link that gets populated in the href field.

Currently the script works exactly like you mentioned. It sees what formula zippyshare has sent and then we extract values for those variables from code and then execute that formula to get the value. But that leaves us exactly where we are right now. If zippyshare changes their formula, we have to accommodate for one more pattern. We are looking for a solution that is independent of whatever JS logic zippyshare embeds in their webpage. If it works in the browser, it should work here as well.

tyeeman commented 4 years ago

Take a look here - https://github.com/victor-oliveira1/ZippyDown/issues/4

They have the same problem and supposedly have fixed it with selenium and phantomjs. Maybe you can do somewhat the same. It's worth a read.

Cartmanishere commented 4 years ago

That looks promising @tyeeman. I will experiment with PhantomJS and Selenium.

Edit: Based on initial experimentation, extracting links using PhantomJS and Selenium works. It is even quite simple to setup and use. But it is 10x slower than the current way.

For few links, the overhead is not that much a problem. But when no. of links you're trying to download go into 100s it will slow things down considerably.

I can look at ways to optimize using PhantomJS and Selenium. At this point, I think we can include it as a fallback. So this is used only in case all the text based patterns fail.

SebastianMoyano commented 4 years ago

Hi, just to add a little, tend to download videos from zippy, and in many of them the video source appears, so I download it from there, this is my alternative method if I couldn't get the formula, but they keep changing it and not all their videos shows up the source. But at least is a direct way to download some video files without formula

saneFriend commented 4 years ago

Tried this script for the first time today with my own links and the provided example link, all patterns fail and fail to extract links. Did they perhaps update the patterns?

NextDev65 commented 4 years ago

yep its still not working for me

Cartmanishere commented 4 years ago

@saneFriend @Matterhorn56 the script seems to be working. If you are facing problem, can you open a new issue with some example links so we can discuss and resolve it there.

Cartmanishere commented 3 years ago

Added a python javascript implementation to execute the JS code instead of using a complete browser implementation.

Cartmanishere / zippyshare-scraper

Alternative approaches to extracting links #13