codeblech / jpgram

zero effort college magazine for jaypee
https://jiit.pythonanywhere.com/
MIT License
1 stars 2 forks source link

add ci/cd : the actual self-updating bit :) #10

Closed codeblech closed 2 months ago

codeblech commented 2 months ago

added Scheduled tasks on anywhere python.

codelif commented 2 months ago

After banging my head for 3 days, I have figured out why instaloader doesn't work on prod server. I recommend getting a bucket of popcorn because this shi- is crazy.

Prologue

Instaloader is designed to do something that Instagram is designed to stop, which is, scraping.

What went wrong?

Instaloader can retrieve photos from public accounts but only till a limit. That limit is decided by Instagram. After exceeding it, instagram will ask you to login to view furthur posts.

Instaloader is aware of this and to circumvent this you can add an account to instaloader which it will use to retrieve posts. All is well until this point.

Shit starts to go down when Instagram detects that you are using bots (instaloader in this case), to retrieve posts and flags your account. When it flags your account, you recieve a challenge to solve, without solving them you cannot access any posts.

What this means for instaloader?

Instaloader is also aware of this prints the links to these challenges to console, which an operator can manually open in a browser, after which instagram will start to behave and you can scrape all you want.

So how does this relate to instaloader not working on python anywhere?

When this scheduled task is run on python anywhere, instagram immediately flags the IP of python anywhere and sends it a challenge, which it cannot solve automatically.

I tried making an auto-updating github repo to work as an image cache using github action. I encountered the same problem: image

If I don't add a login, in the start a few posts will download but afterwards they will fail like this: image

Accessing these links from your laptop doesn't work as these challenges are linked to IP of the GitHub Action servers. So, they need to accessed through them.

What can we do?

Not much but maybe... idk. Need to rack some brain cells.

We need to figure out a way to solve these challenges through the IP of the hosting servers. Maybe we can create a proxy server through which we can route our machine's traffic to the hosting server. Seems complex but doable.

Or someone need to sacrifice their laptop as a server (we can figure out hosting through cloudflare tunnels and shit)

As for the storage problem, I have figured out that Github CDN (githubusercontent.com) allows CORS to other websites, so we can make a seperate repo which will cache the storage and the main server just needs to index it.

Conclusion

Fk Instagram

codeblech commented 2 months ago

First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR! As for the scraper blocking by IG... can't think of a solution rn...maybe we should look at how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...

codeblech commented 2 months ago

Fk Instagram

fr

codeblech commented 2 months ago

Or someone need to sacrifice their laptop as a server (we can figure out hosting through cloudflare tunnels and shit)

naah bro...we'd rather do it manually, weekly atp 😭

codelif commented 2 months ago

First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!

Then could you make a seperate repo like jpgram-cdn or jpgram-lolz lolz or whatever?

codeblech commented 2 months ago

yoooo what if we write a bash script to.... 👿 ...do this on our own machines. download everything & push to a repo.

codelif commented 2 months ago

yoooo what if we write a bash script to.... 👿 ...do this on our own machines. download everything & push to a repo.

Thats what I had in mind

codelif commented 2 months ago

yoooo what if we write a bash script to....

What if.... what if.... (hear me out), what if we write the server in bash? BASH FTW.....

codeblech commented 2 months ago

server? you mean like...apache?

codelif commented 2 months ago

server? you mean like...apache?

I was just kidding lol.

codelif commented 2 months ago

First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!

Then could you make a seperate repo like jpgram-cdn or jpgram-lolz lolz or whatever?

This....

codeblech commented 2 months ago

server? you mean like...apache?

I was just kidding lol.

that's be crazy tho. chad move. checkmate zuck

codeblech commented 2 months ago

First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!

Then could you make a seperate repo like jpgram-cdn or jpgram-lolz lolz or whatever?

This....

ye ye so... i make a separate repo for the cdn, and make you collaborator...is that alright?

codelif commented 2 months ago

First of all, I think GitHub CDN would do the job. In case you wanna implement that, go right ahead with a PR!

Then could you make a seperate repo like jpgram-cdn or jpgram-lolz lolz or whatever?

This....

ye ye so... i make a separate repo for the cdn, and make you collaborator...is that alright?

collaborator is not necessary. PRs are fine.

codelif commented 2 months ago

how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...

Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.

codelif commented 2 months ago

That's smart and stupid.

I mean the websites ofc.

codeblech commented 2 months ago

how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...

Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.

ye dude...they let all the web crawlers & creepers touch them however they please

codeblech commented 2 months ago

https://github.com/codeblech/jpgram-cdn

codelif commented 2 months ago

how google web crawlers do the scraping. 13ft.io pretends to be a google bot to remove paywalls. Apart from, can't think of any other solution rn...

Daaang. That's smart and stupid. So, websites do this for SEO? paywall is just a hoax? damnnn.

ye dude...they let all the web crawlers & creepers touch them however they please

freaky

codeblech commented 2 months ago

creatively named repo

codeblech commented 2 months ago

using cdn now.