Closed codeblech closed 2 months ago
After banging my head for 3 days, I have figured out why instaloader doesn't work on prod server. I recommend getting a bucket of popcorn because this shi- is crazy.
Instaloader is designed to do something that Instagram is designed to stop, which is, scraping.
Instaloader can retrieve photos from public accounts but only till a limit. That limit is decided by Instagram. After exceeding it, instagram will ask you to login to view furthur posts.
Instaloader is aware of this and to circumvent this you can add an account to instaloader which it will use to retrieve posts. All is well until this point.
Shit starts to go down when Instagram detects that you are using bots (instaloader in this case), to retrieve posts and flags your account. When it flags your account, you recieve a challenge to solve, without solving them you cannot access any posts.
Instaloader is also aware of this prints the links to these challenges to console, which an operator can manually open in a browser, after which instagram will start to behave and you can scrape all you want.
When this scheduled task is run on python anywhere, instagram immediately flags the IP of python anywhere and sends it a challenge, which it cannot solve automatically.
I tried making an auto-updating github repo to work as an image cache using github action. I encountered the same problem:
If I don't add a login, in the start a few posts will download but afterwards they will fail like this:
Accessing these links from your laptop doesn't work as these challenges are linked to IP of the GitHub Action servers. So, they need to accessed through them.
Not much but maybe... idk. Need to rack some brain cells.
We need to figure out a way to solve these challenges through the IP of the hosting servers. Maybe we can create a proxy server through which we can route our machine's traffic to the hosting server. Seems complex but doable.
Or someone need to sacrifice their laptop as a server (we can figure out hosting through cloudflare tunnels and shit)
As for the storage problem, I have figured out that Github CDN (githubusercontent.com) allows CORS to other websites, so we can make a seperate repo which will cache the storage and the main server just needs to index it.
Fk Instagram
First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR! As for the scraper blocking by IG... can't think of a solution rn...maybe we should look at how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...
Fk Instagram
fr
Or someone need to sacrifice their laptop as a server (we can figure out hosting through cloudflare tunnels and shit)
naah bro...we'd rather do it manually, weekly atp ðŸ˜
First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!
Then could you make a seperate repo like jpgram-cdn
or jpgram-lolz
lolz or whatever?
yoooo what if we write a bash script to.... 👿 ...do this on our own machines. download everything & push to a repo.
yoooo what if we write a bash script to.... 👿 ...do this on our own machines. download everything & push to a repo.
Thats what I had in mind
yoooo what if we write a bash script to....
What if.... what if.... (hear me out), what if we write the server in bash? BASH FTW.....
server? you mean like...apache?
server? you mean like...apache?
I was just kidding lol.
First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!
Then could you make a seperate repo like
jpgram-cdn
orjpgram-lolz
lolz or whatever?
This....
server? you mean like...apache?
I was just kidding lol.
that's be crazy tho. chad move. checkmate zuck
First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!
Then could you make a seperate repo like
jpgram-cdn
orjpgram-lolz
lolz or whatever?This....
ye ye so... i make a separate repo for the cdn, and make you collaborator...is that alright?
First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!
Then could you make a seperate repo like
jpgram-cdn
orjpgram-lolz
lolz or whatever?This....
ye ye so... i make a separate repo for the cdn, and make you collaborator...is that alright?
collaborator is not necessary. PRs are fine.
how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...
Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.
That's smart and stupid.
I mean the websites ofc.
how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...
Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.
ye dude...they let all the web crawlers & creepers touch them however they please
how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...
Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.
ye dude...they let all the web crawlers & creepers touch them however they please
freaky
creatively named repo
using cdn now.
added Scheduled tasks on anywhere python.