Site Scrape... - Githubissues

datahead8888 commented 10 years ago

It's pretty obvious secretmaryo.org is on the down trend and is at risk of being shut down (I don't think registration of id's is even turned on any more right now). This is a task to obtain a site scrape so that we have a copy of the site for personal use. I've currently been using HTTrack for this.

Items:

Obtain copy of site scrape, place on Alexandria server
Expose to Internet via unpublished link
We can talk about making it available to all contributors by either distributing the private link or by adding a login

I lost my last site scrape accidentally but am currently running it again locally. I may try to set it up on the Alexandria server later if I actually get enough time to read up on the command line version of HTTrack.

Quintus commented 10 years ago

@datahead8888 How is the progress? I just tracked down a 1000+ posts spambot on the old SMC grounds again, together with several smaller, but no less aggressive bots. I basically give the Off-Topic forum lost, it’s impossible to kill them all out of it. I’m pretty sure that, given the high frequency of new spambots arriving more and more, the time is running out. We cannot stand them much longer, hence we need the backup pretty quickly, before these damned spambots finally win the battle.

Vale, Quintus

datahead8888 commented 10 years ago

@datahead8888 How is the progress?

The site scraper said it was finished on my local machine. I would have double checked it already, but, unfortunately, I've been really under the gun on school research, etc. recently. I have to prioritize school in cases like this.

given the high frequency of new spambots arriving more and more, the time is running out.

Yes, this was a concern of mine, though I didn't realize we've got spam bots posting 1000+ posts. I've tried to ban a few posters in the last few weeks but haven't had time to clean stuff up on secretmaryo.org. Aside from having bad scrapes as a result of these 1000 posts, FluXy may eventually shut the forum down if it becomes a spam board..

There's a reasonable chance I can try and double check my local site scrape this weekend to see if it looked successful or not. I also don't think I'm going to be able to set it up on the Alexandria server anytime soon due to time constraints. I'm sorry for the trouble.

datahead8888 commented 10 years ago

@Quintus, I just checked over my scrape a bit now.

Most of the links were looking good. I'm guessing it terminated early during my rerun because it didn't actually wipe the entire scrape out.

I noticed some bad links for some of the very old posts in the New Graphics forum section (I think this was the reason I had attempted to rerun the site scrape). Other than that, each link I tried was working.

I'd like to ftp my scrape to the Alexandria server some time soon. What would it take to log into the Alexandria server from Windows - do I need to transfer my public key from Ubuntu Linux to Windows? I usually need Windows for schoolwork, so this would make it easier to access the server for me. I would probably just use Putty to connect.

It would be reasonable to try another site scrape to see if we can get a better one, but it's probably going to have to be run from scratch. The Alexandria server would probably be the best place to configure a new one and try again, as you said.

Quintus commented 10 years ago

datahead8888 notifications@github.com writes:

Most of the links were looking good. I'm guessing it terminated early during my rerun because it didn't actually wipe the entire scrape out.

Great!

I'd like to ftp my scrape to the Alexandria server some time soon. What would it take to log into the Alexandria server from Windows - do I need to transfer my public key from Ubuntu Linux to Windows? I usually need Windows for schoolwork, so this would make it easier to access the server for me. I would probably just use Putty to connect.

I’ve never used a Windows machine for SSH, so I can’t be of any help here. I can only tell you that for authentication, you of course need your private key or the server won’t let you in. You can also add a second keypair to your ~/.ssh/authorized_keys file on the server if you don’t want to share the private key between Linux and Windows (which I wouldn’t recommend for security reasons) -- just generate a new keypair on Windows, and use your existing private key on Linux to transfer the new public key to your ~/.ssh/authorized_keys.

It would be reasonable to try another site scrape to see if we can get a better one, but it's probably going to have to be run from scratch. The Alexandria server would probably be the best place to configure a new one and try again, as you said.

It could even be set up as a Cron job, so it runs automatically every month or so.

Vale, Quintus

Luiji commented 9 years ago

@datahead8888:

Download PuTTYgen from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Follow these instructions: http://katsande.com/using-puttygen-to-generate-ssh-private-public-keys
Download FileZilla from https://filezilla-project.org/
Follow these instructions: http://tecadmin.net/import-private-key-in-filezilla/

datahead8888 commented 9 years ago

I've placed SMCWebSite.tar.gz in /home/datahead/SMC-Website-Backup on the Alexandria server.

While testing the website on Ubuntu (rather than Windows as before), Firefox was not recognizing .php files. I can double check this again in Windows later.

As discussed earlier there should be a small number of dead links in this scrape. We will probably want to try again later and configure scrapes from the Alexandria server. This scrape, however, is much better than having no scrape.

Quintus commented 9 years ago

datahead8888 notifications@github.com writes:

I've placed SMCWebSite.tar.gz in /home/datahead/SMC-Website-Backup on the Alexandria server for now.

Errm, you know that you should better keep permissions on your home directory restrictive so that others can’t look into it? Also, that directory will not be found and served by Apache httpd.

While testing the website on Ubuntu (rather than Windows as before), Firefox was not recognizing .php files. I can double check this again in Windows later

There is no PHP on the server even installed. But the site scrape can only have yielded static HTML anyway, so it should just display it as-is. Place it in /srv/http/users/datahead and we’ll see further.

Vale, Quintus

datahead8888 commented 9 years ago

Errm, you know that you should better keep permissions on your home directory restrictive so that others can’t look into it?

It's just a site scrape file, but sure, I should look into my home directory permissions.

Also, that directory will not be found and served by Apache httpd.

I couldn't find the correct directory last night so I dumped it somewhere for now. I will try again later.

datahead8888 commented 9 years ago

The site scrape can be viewed at: http://team.secretchronicles.de/~datahead/MyWebSites/index.html

We will probably want to make a new site scrape, run from the Alexandria server later. Also note this one may have some dead linkes in it if you look hard enough.

I propose we close this task and open a new one for the Alexandria up-to-date scrape in the website repo. We can worry about what directory we want this to be in for everyone for the final scrape.

Secretchronicles / TSC

Site Scrape... #246