e107inc / e107

e107 Bootstrap CMS (Content Management System) v2 with PHP, MySQL, HTML5, jQuery and Twitter Bootstrap. Issue Discussion Room: https://gitter.im/e107inc/e107
https://e107.org
GNU General Public License v3.0
318 stars 212 forks source link

[Question]: huge table e107_session #5262

Open ericc-dream opened 1 month ago

ericc-dream commented 1 month ago

Question

Hi

I have a question. I'm moving my site to some different hosting and I noted something "annoying" When I exported my database from the server of my old hosting, it had a size of around 8Mb give or take I restored everything less than 1 week ago, and today I tried to export it again , and the size is now 25Mb !!! more than 3 times ... more specifically the table e107_session is now more than 17Mb alone ! I had 6 rows inside, I have now 17334 ... and 17333 have "session_user = 0"

I wonder, do you store the session ID for every visitor of the site in this table ? it's a little bit too much ... taking into account that I have only 2 users on my site, including myself How can I limit, reduce that ?

I see a lot of bots hitting my site and crawling in the forum, mainly from "semrush" I guess that they are the reason of the explosion of my database Do I have a simple way to block this bot(s) ?

In a more general question : When I export my database , is there some tables that I can exclude ? If a table doesn't exist, is the code able to recreate it ? (I can export the "CREATE" but not the content eventually)

Expected Source

I checked the content of the current DB with the previous export

Additional Context

No response

Jimmi08 commented 1 month ago

Hi, it was discussed on gitter some time ago. A session record is created for boots too. I reported it. Not sure if an issue was created. it is around Thu, May 11, 2023 , you can look for word sessions. I am copying solution by @Deltik :

Although there currently isn't a way to clean up guest sessions sooner, you could create a cron job script like this:

<?php
require_once('e107_config.php');

$gc_maxlifetime = (int)ini_get('session.gc_maxlifetime');
if (!$gc_maxlifetime) {
    $gc_maxlifetime = 3600;
}

$expiration_time = time() - $gc_maxlifetime + 3600; // One hour

$dsn = "mysql:host=$mySQLserver;dbname=$mySQLdefaultdb;charset=$mySQLcharset";
$pdo = new PDO($dsn, $mySQLuser, $mySQLpassword);

$table_name = $mySQLprefix . "session";

$query = "DELETE FROM `$table_name` WHERE (session_user = 0 OR session_user IS NULL) AND session_expires < $expiration_time LIMIT 1000";

$total_deleted = 0;
do {
    $stmt = $pdo->prepare($query);
    $stmt->execute();
    $deleted_rows = $stmt->rowCount();
    $total_deleted += $deleted_rows;
} while ($deleted_rows > 0);

echo "Deleted " . $total_deleted . " rows from the $table_name table.\n";

(I haven't tested it myself, but it should delete guest sessions older than 1 hour.)

ericc-dream commented 1 month ago

I have blocked them in .htaccess !

# allow all except those indicated here
<Files *>
order allow,deny
allow from all
#  SEMrushBot ban
deny from 46.229.164.98
deny from 46.229.164.100
deny from 46.229.164.101
deny from 85.208.96.192/27
deny from 185.191.171.0/27
</Files>

I don't see them anymore in the admin panel and the number of entries in the table seems to decrease ... but it's only 10mins ago so it will take time

rica-carv commented 1 month ago

@ericc-dream You'll have a lot of IP's to block, that's just a very very very small tip of the iceberg....

I've used some third party software to block those bots right from the start, and i'll try to use it again when i got my site back up and running, because i had the same issue that you have, and my db just was groing everyday...

For me, that software was the best and easiest approach

Jimmi08 commented 1 month ago

@rica-carv I will add badbehaviour script for other cms they are very satisfied with it too.

The topic above (at least part of it) was about the life of the quest session - I complained about this a long time ago. They should be deleted after an hour by core, Or you can set the session lifetime to an hour and you will not have this issue.

ericc-dream commented 1 month ago

So I don't really mind about bots ... at the end, they can help to have some traffic But this one "Semrush" was way too aggressive ! I had always 15 connections minimum from it Since I have added the range in .htaccess, I didn't saw any , and the size of my db decrease ...

But if you can share the name of the "third party software" that you used ... in case of need ?

rica-carv commented 1 month ago

But if you can share the name of the "third party software" that you used ... in case of need ?

@ericc-dream CIDRAM But supposedly E107 should have a forum spam blocker..... Not the same thing, i know, but who knows if it might help?

Moc commented 1 month ago

The SFS plugin is used on registrations (or on already existing accounts) to check against the stopforumspam database. If a match is found, the registration is denied.

It does not block visitors/guests (bots or not), so it will not help solve this issue. Sessions will still be created.

Alex-e107nl commented 1 month ago

And expanding the .htaccess file, this line is to block bad bots?

Block Bad Bots

SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(libwww-perl|aesop_com_spiderman) HTTP_SAFE_BADBOT
Deny from env=HTTP_SAFE_BADBOT
Moc commented 1 month ago

You'll never get all bots blocked by using htaccess as you're always one step behind. It has become an entire business model to sell packages of IP ranges for bots to use, or for individuals to buy to then block those same IP addresses.

Additionally, using htaccess to block bots strains server performance and affects loading times.

The solution to this issue is not blocking but implementing proper session handling and cleanup.

Please refrain from further comments on blocking bots in this issue report.

@Deltik do you have some enhancements planned for session handling?

ericc-dream commented 1 month ago

To give an update After blocking semrush in .htaccess, the number of session drop quickly. At one moment in time, I deleted all remaining with "user=0" ... Today, 2 days later, I have 304 rows ... which is still high for a site with only 2 members ! but a little more acceptable than the 17333 from the first message

I'm agree with @Moc , we can't win at this game. htaccess was a solution done in urgency to block a too aggressive bot, but not a definitive one.

Is there a way to change this session time from the admin pages ? Can we consider to have 2 different session time ? one short for non logged people and one longer for member

And for my personal knowledge : What will be the consequence for a non-logged person if the session time is like 2min by example ?

CaMer0n commented 3 weeks ago

The method e_user_model::isBot() could help with this.

Deltik commented 3 weeks ago

@Moc: Last I knew, guest sessions were triggered by the e107 CAPTCHA system, which stores the CAPTCHA answer in a session. I want to eliminate sessions for guests by storing the CAPTCHA answer somewhere else. (More info)