Phase 5: Create Index. `42P01: relation "core.new_search_objects" does not exist`

michelescarlato commented 2 months ago

31/08/2024 : 19:01 :   **** SETUP ****

31/08/2024 : 19:01 :   transfer data =  False
31/08/2024 : 19:01 :   create core =  False
31/08/2024 : 19:01 :   create json =  False
31/08/2024 : 19:01 :   do statistics =  False
31/08/2024 : 19:01 :   do iec =  False
31/08/2024 : 19:01 :   do indices =  True
31/08/2024 : 19:01 :   FTW tables recreated

31/08/2024 : 19:01 :   **** SETTING UP STUDY SEARCH TABLES ****

31/08/2024 : 19:01 :   **** CREATING JSON STUDY DATA ****

31/08/2024 : 19:01 :   **** UNHANDLED EXCEPTION ****

+++++++++++++++++++++++++++++++++++++++
31/08/2024 : 19:01 :   ***ERROR*** MDR_Aggregator application aborted

42P01: relation "core.new_search_objects" does not exist

POSITION: 76

   at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|222_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult()
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior)
   at Npgsql.NpgsqlCommand.ExecuteDbDataReader(CommandBehavior behavior)
   at Dapper.SqlMapper.ExecuteReaderWithFlagsFallback(IDbCommand cmd, Boolean wasClosed, CommandBehavior behavior) in /_/Dapper/SqlMapper.cs:line 1066
   at Dapper.SqlMapper.QueryImpl[T](IDbConnection cnn, CommandDefinition command, Type effectiveType)+MoveNext() in /_/Dapper/SqlMapper.cs:line 1094
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at Dapper.SqlMapper.Query[T](IDbConnection cnn, String sql, Object param, IDbTransaction transaction, Boolean buffered, Nullable`1 commandTimeout, Nullable`1 commandType) in /_/Dapper/SqlMapper.cs:line 734
   at MDR_Aggregator.JSONStudyDataLayer.FetchObjectDetails(Int32 study_id) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\SearchStudyHelpers\JSONStudyDataLayer.cs:line 273
   at MDR_Aggregator.JSONStudyProcessor.CreateStudySearchResult(JSONFullStudy st) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\SearchStudyHelpers\JSONStudyProcessor.cs:line 427
   at MDR_Aggregator.SearchHelperJson.LoopThroughStudyRecords(Int32 offset) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\SearchHelpers\SearchHelperJSON.cs:line 95
   at MDR_Aggregator.CoreSearchBuilder.CreateJSONStudyData(Boolean create_table, Int32 offset) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\SearchHelpers\CoreSearchBuilder.cs:line 35
   at MDR_Aggregator.Aggregator.AggregateData(Options opts) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\TopLevelClasses\Aggregator.cs:line 244
   at Program.<Main>$(String[] args) in C:\Users\ecrinadmin\RiderProjects\MDR_Aggregator\TopLevelClasses\Program.cs:line 62
+++++++++++++++++++++++++++++++++++++++

31/08/2024 : 19:01 :   **** CLOSING LOG ****

michelescarlato commented 2 months ago

The core schema is inside MDR. It contains 28 tables, divided into 11 objects_*, 13 study_*, 1 data_objects, 1 studies, and new_search_studies and new_search_studies_json.

The problem is that the new_search_objects doesn't exist.

Analyzing the code:

The CoreStudyTableBuilder.cs create 12 study_* and 1 studies :

core.studies,
core.study_identifiers,
core.study_titles,
core.study_people,
core.study_organisations,
core.study_topics,
core.study_conditions,
core.study_icd,
core.study_features,
core.study_realationships,
core.study_countries,
core.study_locations,
core.study_search,

Study_object_links is not created by CoreStudyTableBuilder (but CoreObjectTableBuilder creates it).

The CoreBuilder calls the methods from the CoreStudyTableBuilder.

The CoreBuilder class also calls CoreObjectTableBuilder, which creates 11 object_* tables, 1 data_objects table, and 1 study_object_links. Still, there is the study_search create table, but is not used ( Rider marks it as an unused method using the grey color for the method name) :

core.data_objects,
core.object_datasets,
core.object_dates,
core.object_instances,
core.object_people,
core.object_organisations,
core.object_titles,
core.object_topics,
core.object_descriptions,
core.object_identifiers,
core.object_relationships,
core.object_rights,
core.study_object_links,
core.study_search (again..? )

So CoreStudyTableBuilder and CoreObjectTableBuilder are responsible of the creation of 26 tables (13 tables each).

The other two tables present are new_search_studies and new_search_studies_json. These two tables do not look to be created by the Aggregator.

A possible solution could be to import it from an older MDR version.

michelescarlato commented 2 months ago

Looking into the logs, it is noticeable that the table disappeared since 22nd July 2024:

:~/Downloads/postgresql_download$grep -i "new_search_objects" *.log

postgresql-2024-07-22_000000.log:2024-07-22 15:55:01.701 UTC [9748] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-07-22_000000.log:           inner join core.new_search_objects os
postgresql-2024-07-22_000000.log:2024-07-22 15:55:23.468 UTC [6420] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-07-22_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-04_000000.log:2024-08-04 15:00:03.910 UTC [12980] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-04_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-11_000000.log:2024-08-11 15:00:02.695 UTC [5616] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-11_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-25_000000.log:2024-08-25 15:00:02.915 UTC [13340] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-25_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-30_000000.log:2024-08-30 19:15:20.482 UTC [16832] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-30_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-31_000000.log:2024-08-31 17:01:54.277 UTC [6644] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-31_000000.log:           inner join core.new_search_objects os
postgresql-2024-08-31_000000.log:2024-08-31 19:31:06.280 UTC [10684] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-08-31_000000.log:           inner join core.new_search_objects os
postgresql-2024-09-01_000000.log:2024-09-01 15:00:02.870 UTC [4108] ERROR:  relation "core.new_search_objects" does not exist at character 76
postgresql-2024-09-01_000000.log:           inner join core.new_search_objects os

michelescarlato commented 2 months ago

To track the creation and deletion of tables in the future, DDL logging can be enabled in postgresql.conf file. This way, any CREATE, DROP, ALTER operations on tables will be logged.

To do so the line of the postgresql.conf file:

#log_statement = 'none'         # none, ddl, mod, all

should be changed in

log_statement = 'ddl' # Logs all DDL (Data Definition Language) commands, such as CREATE, ALTER, DROP

and the DB needs to be restarted.

@sergiocontrino Sergio, what do you think about it ?

michelescarlato commented 2 months ago

Impact analysis

Enabling log_statement in PostgreSQL can provide valuable information for auditing, debugging, or tracking database activity. However, it also has potential drawbacks depending on the logging level you choose. Let’s walk through the impact of each setting and discuss the potential drawbacks.

`log_statement` Options

none (default): No SQL statements are logged.
ddl: Only logs Data Definition Language (DDL) statements (e.g., CREATE, ALTER, DROP).
mod: Logs all ddl statements plus Data Modification Language (DML) statements (e.g., INSERT, UPDATE, DELETE, TRUNCATE).
all: Logs all SQL statements, including SELECT queries and DDL/DML operations.

Potential Drawbacks for Each Level of `log_statement`

log_statement = 'ddl' (Logs only DDL operations)
- Use Case: This is the most common option when you want to track changes to the database schema, such as CREATE, ALTER, and DROP statements, without too much noise from DML queries.
- Drawbacks:
  - Minimal performance overhead because only DDL statements are logged, which are usually not frequent.
  - Log file growth will be moderate since DDL operations typically don't occur as often as other types of queries.
  - May not capture important DML actions (like INSERT, UPDATE, DELETE) that could affect data integrity or performance.
Recommendation: This is generally safe for most production environments where you want to track changes to database schema without affecting performance significantly.
log_statement = 'mod' (Logs DDL and DML operations)
- Use Case: This is useful if you want to track both schema changes and modifications to the data (e.g., INSERT, UPDATE, DELETE, TRUNCATE).
- Drawbacks:
  - Performance impact: Logging DML statements can introduce a slight performance overhead, especially in write-heavy systems (many INSERT or UPDATE operations).
  - Log file growth: The log files can grow rapidly on databases with frequent data changes, especially if you have high-volume transactions like batch inserts or updates.
  - Disk space: If not managed, excessive logging can quickly consume disk space, leading to potential operational issues.
  - Log noise: In large, active databases, sifting through logs for meaningful information can be challenging.
Recommendation: This is suitable if you need more detailed auditing, but you must have a strategy for managing log size (e.g., log rotation and archiving).
log_statement = 'all' (Logs everything, including SELECT queries)
- Use Case: Useful for auditing, debugging, or performance tuning where you need to log every query, including SELECT statements.
- Drawbacks:
  - Significant performance impact: Logging every single query (including SELECT statements) can considerably affect the performance of the database, especially on read-heavy systems. Every query will be written to the log, including frequent or large SELECT queries.
  - High disk usage: Log files will grow very quickly, especially on databases with a lot of read operations (SELECT). This can result in massive log files that need to be regularly rotated and archived.
  - Storage overhead: If log rotation isn't configured properly, the server could run out of disk space due to excessively large log files.
  - Difficult to parse: With large volumes of log data, it can be difficult to find the specific information you’re looking for (e.g., a CREATE TABLE statement in a sea of SELECT queries).
Recommendation: log_statement = 'all' should be used very carefully in production environments. It’s best suited for debugging or development environments where performance isn’t critical, or for short-term use in production if you are trying to troubleshoot a specific issue.

Key Considerations When Enabling `log_statement`

Performance Overhead:
- Logging every SQL statement introduces overhead because each query must be written to disk. The impact depends on your database’s workload:
  - DDL (minimal impact): DDL changes are usually infrequent and have minimal impact on performance.
  - DML (moderate impact): Logging INSERT, UPDATE, and DELETE statements can slow down databases with frequent data modifications.
  - SELECT (high impact): On read-heavy databases, logging all SELECT statements can add significant overhead and degrade performance.
Disk Space Usage:
- The more you log, the faster your log files will grow. Without proper log rotation and archiving, you may quickly run out of disk space.
- To mitigate this, ensure that log rotation is configured. You can control log file size and retention with the following settings in postgresql.conf:
  - log_rotation_size: Rotate logs when they reach a certain size.
  - log_rotation_age: Rotate logs after a specific time period.
  - log_truncate_on_rotation: Truncate logs when they are rotated instead of appending.
Example settings for log rotation:
```
log_rotation_size = '100MB'          # Rotate logs when they reach 100MB
log_rotation_age = '1d'              # Rotate logs daily
log_truncate_on_rotation = on        # Truncate logs on rotation
```
Log Management:
- Make sure you have an automated log rotation or archiving policy in place to manage disk space and ensure logs don’t overwhelm your system.
- You can also compress old logs or move them to another storage system for long-term retention.
Security Considerations:
- Logging SQL statements may expose sensitive data (e.g., passwords, sensitive user data) if queries include them as literals.
- If security is a concern, avoid logging queries that contain sensitive information, or consider anonymizing certain data before logging.
Audit Requirements:
- For audit trails or compliance, logging all DDL statements is often sufficient (log_statement = 'ddl').
- For tracking data changes, you can extend logging to include DML (log_statement = 'mod').

Recommended Settings for Most Use Cases

Production Environments:
- Minimal overhead, schema change tracking: log_statement = 'ddl' is a good balance between performance and visibility, as it logs all schema changes without logging every query.
- Moderate overhead, track data modifications: log_statement = 'mod' is useful if you want to track data changes but still avoid the overhead of logging SELECT queries.
Development/Debugging:
- If you need to track every query for debugging or performance tuning, you can temporarily enable log_statement = 'all'. Make sure to monitor performance and log file size while doing this, and disable it after you’ve collected enough information.

Summary

ddl: Logs only schema changes. Low overhead and generally recommended for production environments to track changes like CREATE, DROP, and ALTER statements.
mod: Logs schema changes and data modifications (e.g., INSERT, UPDATE, DELETE). Moderate overhead, useful for auditing changes to data.
all: Logs everything, including SELECT statements. High overhead, recommended only for temporary use in debugging scenarios.

To minimize potential performance and storage issues, consider your logging needs carefully, and implement log rotation policies to manage disk space usage. Let me know if you need more specific guidance based on your use case!

michelescarlato commented 2 months ago

I looked into the MDR Test server, but I can see that also new_search_studies and new_search_studies_json are not present (apart from the new_search_objects).

michelescarlato commented 2 months ago

I restored a backup found in MDR_Test, and I got only the following tables:

michelescarlato commented 1 month ago

The issue is that the function that should create the core.new_search_objects , (not hosted in the master branch), is commented

https://github.com/ecrin-github/MDR_Aggregator/blob/feature/bump-packages/TopLevelClasses/Aggregator.cs#L238

if (opts.do_indexes) // `phase 5` -->  `-X`
        {
            // There are two aspects of setting up search data. One is to create searchable
            // tables to respond to queries and filters. The other is to set up
            // suitable JSON fields to return to the UI in response to those queries.
            // The lup schemas from context, as well as the aggs schemas, are required.

            string core_conn_string = _credentials.GetConnectionString("mdr");
            List<string> aggs_schemas = new() { "st", "ob", "nk" };
            _monDatalayer.SetUpTempFTWs(_credentials, core_conn_string, "core", "aggs", aggs_schemas);
            List<string> ctx_schemas = new() { "lup" };
            _monDatalayer.SetUpTempFTWs(_credentials, core_conn_string, "core", "context", ctx_schemas);
            _loggingHelper.LogLine("FTW tables recreated");

            // Initial task is to create JSON versions of the object data (as part of this will be
            // incorporated into study json and tables, from where it can be returned when necessary).
            // Querying and filtering is almost always done against studies rather than objects - the
            // exception being PMIDs and even then it is studies that are returned.
            // Preparing object data is therefore focused on creating the JSON required. The routine
            // below generates both a 'full' JSON image of each object plus a much smaller JSON
            // fragment that will be returned within search results.

            CoreSearchBuilder csb = new CoreSearchBuilder(core_conn_string, _loggingHelper);

            //_loggingHelper.LogHeader("Creating JSON object data");
            // csb.CreateJSONObjectData();  

            // Tables are then created to hold data for querying in various ways

            _loggingHelper.LogHeader("Setting up study search tables");

            //csb.CreateIdentifierSearchDataTable();
            //csb.CreatePMIDSearchDataTable();
            //csb.CreateLexemeSearchDataTable();
            //csb.CreateCountrySearchDataTable();  

            // The study data json objects are then created

            _loggingHelper.LogHeader("Creating JSON study data");
            csb.CreateJSONStudyData();

            _loggingHelper.LogHeader("Creating data in search tables");
            csb.AddStudyJsonToSearchTables();
            csb.SwitchToNewTables();

            // Drop FTW schemas.

            _monDatalayer.DropTempFTWs(core_conn_string, "aggs", aggs_schemas);
            _monDatalayer.DropTempFTWs(core_conn_string, "context", ctx_schemas);
            _loggingHelper.LogLine("FTW tables dropped");
        }

I am trying commenting out:

csb.CreateJSONObjectData();  
csb.CreateIdentifierSearchDataTable();
csb.CreatePMIDSearchDataTable();
csb.CreateLexemeSearchDataTable();
csb.CreateCountrySearchDataTable();

ecrin-github / MDR_Aggregator