Geeklog-Core / geeklog

Geeklog - The Secure CMS.
https://www.geeklog.net
25 stars 19 forks source link

Converting IP Address Upgrade Process takes to long and times out #1127

Closed eSilverStrike closed 2 years ago

eSilverStrike commented 2 years ago

So the function update_TablesContainingIPAddresses222 times out on my development server (it is not a fast server to begin with) after going through 4000 of 6000 comment records. PHP time out was set to 120 seconds.

This hasn't even gotten to the likes table which has over 90,000 records I believe

So all in all my main Geeklog site will have close to 100,000 records in the ipaddress table when all said and done. Geeklog.net will probably have a similar number.

We need to come up with a way to make it faster to hopefully run on most Geeklog websites. For those sites like Geeklog.net that have been around for a while and has thousands of records I need to figure out a way to at least get the upgrade to finish and then run a script afterwards manually just to finish the IP stuff.

@mystralkk I've come up with a script which I hope will help but need you to look at it since I am wondering about a few things as I am not 100% sure how IP Anonymization functions.

First question - in the gl_ip_addresses should the ipaddress column always be unique and not duplicated? If so should we add a unique index to it?

Will this script work? (I think it mimics what update_TablesContainingIPAddresses222 currently does)

It will only insert an IP address if it doesn't already exist in the gl_ip_addresses table.

ALTER TABLE gl_comments ADD COLUMN seq INT NOT NULL DEFAULT 0;
ALTER TABLE gl_commentsubmissions ADD COLUMN seq INT NOT NULL DEFAULT 0;
ALTER TABLE gl_likes ADD COLUMN seq INT NOT NULL DEFAULT 0;
ALTER TABLE gl_sessions ADD COLUMN seq INT NOT NULL DEFAULT 0;
ALTER TABLE gl_trackback ADD COLUMN seq INT NOT NULL DEFAULT 0;

INSERT INTO gl_ip_addresses (ipaddress, created_at)  
SELECT ipaddress, UNIX_TIMESTAMP() FROM gl_comments c WHERE c.ipaddress NOT IN (SELECT ipaddress FROM gl_ip_addresses i WHERE c.ipaddress = i.ipaddress);

INSERT INTO gl_ip_addresses (ipaddress, created_at)  
SELECT ipaddress, UNIX_TIMESTAMP() FROM gl_commentsubmissions c WHERE c.ipaddress NOT IN (SELECT ipaddress FROM gl_ip_addresses i WHERE c.ipaddress = i.ipaddress);

INSERT INTO gl_ip_addresses (ipaddress, created_at)  
SELECT ipaddress, UNIX_TIMESTAMP() FROM gl_likes l WHERE l.ipaddress NOT IN (SELECT ipaddress FROM gl_ip_addresses i WHERE l.ipaddress = i.ipaddress);

INSERT INTO gl_ip_addresses (ipaddress, created_at)  
SELECT remote_ip, UNIX_TIMESTAMP() FROM gl_sessions s WHERE s.remote_ip NOT IN (SELECT ipaddress FROM gl_ip_addresses i WHERE s.remote_ip = i.ipaddress);

INSERT INTO gl_ip_addresses (ipaddress, created_at)  
SELECT ipaddress, UNIX_TIMESTAMP() FROM gl_trackback t WHERE t.ipaddress NOT IN (SELECT ipaddress FROM gl_ip_addresses i WHERE t.ipaddress = i.ipaddress);

UPDATE gl_comments c SET c.seq = (SELECT seq FROM gl_ip_addresses i WHERE i.ipaddress = c.ipaddress LIMIT 1);
UPDATE gl_commentsubmissions c SET c.seq = (SELECT seq FROM gl_ip_addresses i WHERE i.ipaddress = c.ipaddress LIMIT 1);
UPDATE gl_likes l SET l.seq = (SELECT seq FROM gl_ip_addresses i WHERE i.ipaddress = l.ipaddress LIMIT 1);
UPDATE gl_sessions s SET s.seq = (SELECT seq FROM gl_ip_addresses i WHERE i.ipaddress = s.remote_ip LIMIT 1);
UPDATE gl_trackback t SET t.seq = (SELECT seq FROM gl_ip_addresses i WHERE i.ipaddress = t.ipaddress LIMIT 1);

ALTER TABLE gl_comments DROP COLUMN ipaddress;
ALTER TABLE gl_commentsubmissions DROP COLUMN ipaddress;
ALTER TABLE gl_likes DROP COLUMN ipaddress;
ALTER TABLE gl_sessions DROP COLUMN remote_ip;
ALTER TABLE gl_trackback DROP COLUMN ipaddress;
mystralkk commented 2 years ago

First question - in the gl_ip_addresses should the ipaddress column always be unique and not duplicated? If so should we add a unique index to it?

No, ipaddress column shouldn't be unique at all. If so, I would have made the ipaddress column the primary key.

eSilverStrike commented 2 years ago

Hmmm I thought that might be the case.

My script just assigns all records that have the same IP to the same record in the ip_addresss table. Is this okay since this is when the table was created and only happens at this point? Should everything still work with the IP anonymize feature?

If not, we need to think of a way to speed up this process or upgrade processes that have thousands of records to deal with, will fail.

eSilverStrike commented 2 years ago

@mystralkk regarding:

My script just assigns all records that have the same IP to the same record in the ip_addresss table. Is this okay since this is when the table was created and only happens at this point? Should everything still work with the IP anonymize feature?

Or should each record from gl_comments, gl_comments submissions, etc have its own record in the ip_addresss table?

Once we figure this out I will be running a stagging website for geeklog.net to test out the install with it.

mystralkk commented 2 years ago

My script just assigns all records that have the same IP to the same record in the ip_addresss table. Is this okay since this is when the table was created and only happens at this point? Should everything still work with the IP anonymize feature?

Or should each record from gl_comments, gl_comments submissions, etc have its own record in the ip_addresss table?

I assumed the latter case, but as you explain, the former case should work fine. So go ahead with your script.

eSilverStrike commented 2 years ago

Okay I updated the process. I actually did it the right way (the second option) so each record will have it's own IP record.

I also added a warning before this upgrade.

I got the upgrade process down to about 1:30min (dealing with about 100,000 ips) which is much better than the +10 mins of my first way. I was able to get rid of the sub queries by inserting some temp ids for the records in the IP Address table which is used by the update sql statement. (and then removing the temp ids afterwards).

I'll now have to test it with a copy of geeklog.net

eSilverStrike commented 2 years ago

@mystralkk I guess this ip address table will get quite large at some point and can never be trimmed?

Once sessions are cleared does it also remove the associated records from ip_addresses?

This is on PHP 8.1 (I am assuming the issue is their on PHP 7)

Update So it looks like it does remove the session IP addresses when closed (should probably still be confirmed for all cases)

I was having some issues where they were staying but it looks like it may have been a session issue from multiple testing of the install and upgrade. Once I deleted all the PHP session files from the web server, session records from the Geeklog Database, and the session cookies things seemed to start working correctly.

The only thing I see not working is once the install, upgrade, or migrate is complete and the user is on the success page of the install the IP Address for the session is a blank for some reason. This will remain a blank until that user session logs into Geeklog. If other anonymous users visit from a different browser things work fine.

eSilverStrike commented 2 years ago

BTW it looks like Geeklog.net has over 350,000 records

eSilverStrike commented 2 years ago

Final issue fixed with issue #1130