Ericsson / CodeCompass

CodeCompass is a software comprehension tool for large scale software written in C/C++ and Java
https://codecompass.net
GNU General Public License v3.0
497 stars 96 forks source link

Multiple CppFunction entities #720

Open mcserep opened 3 months ago

mcserep commented 3 months ago

As @dbukki discovered in https://github.com/Ericsson/CodeCompass/pull/714#issuecomment-1983388202, multiple CppFunction entities are persisted for each function: one for each declaration and one for the definition.

It should be examined whether these multiple entities are truly required or not. Key aspects to focus on during the examination:

dbukki commented 3 months ago

Current state

Depending on the kind of CppEntity being parsed, declarations and instantiations/calls may or may not persist additional (duplicate) entities for the relevant pieces of code into the database. Here are my observations on the current state of our master branch (without any modifications):


Variables

This code produces 3 CppVariable entities, even though they are all technically just declarations:

extern int v0;
extern int v0;
extern int v0;

This code also produces 3 CppVariable entities, even though only one of them is the definition:

extern int v1;
       int v1;
extern int v1;

Note that an AST node is also generated for each of them, so navigation on the UI is working fine.

Records

This code produces no CppRecord entities, as there is no definition:

struct s0;
struct s0;
struct s0;

This code produces 1 CppRecord entity for the one definition:

struct s1;
struct s1 {};
struct s1;

Note that an AST node is still generated for each of them regardless of definition, so navigation on the UI is working fine.

Functions

This code produces 3 CppFunction entities, even though they are all declarations:

void f0();
void f0();
void f0();

This code also produces 3 CppFunction entities, even though only one of them is the definition:

void f1();
void f1() {}
void f1();

Note that an AST node is also generated for each declaration and definition as well, so navigation on the UI is working fine.

This code produces 4 CppFunction entities for f2, and 4 CppVariable entities for p0, p1, p2 and the unnamed parameter.

void f2(int p0);
void f2(int p1) {}
void f2(int p2);
void f2(int);

Interesting side note: When I select either of the f2 variants on the UI, the Info Tree displays p0 as the parameter of f2, regardless of which declaration I select. It seems that parameter info are queried on a "first entity that matches the function hash" basis. This means that getting rid of multiple CppFunction entities would probably still result in a correct match. If we only generated CppVariable entities for the definition, that would also always result in that name being returned which will actually be used in the function's body. This would be more logical here, as the parameter whose value would actually be used is p1 in this case.

This code produces 5 CppFunction entities for f3, and 5 CppVariable entities for p.

void f3(int p);     // in a shared main.h
void f3(int p) {}   // in main.cpp, including main.h
void f3(int p = 0); // in def1.cpp, including main.h
void f3(int p = 1); // in def2.cpp, including main.h
void f3(int p = 2); // in def3.cpp, including main.h

Again, an interesting side note here is that regardless of which declaration I select, the default values are not displayed on the Info Tree anyway, so it probably wouldn't matter if the multiple definitions of p were reduced to just one. (Unless the lack of default argument display is already an issue on its own...)


Templates

A similar pattern can be observed with records and functions when making them into templates.

Template records

The following code produces no CppRecord entities, as there is no definition:

template<typename T> struct ts0;
template<typename T> struct ts0;
template<typename T> struct ts0;

The following code produces 1 CppRecord entity for the one definition:

template<typename T> struct ts1;
template<typename T> struct ts1 { T value; };
template<typename T> struct ts1;

Furthermore, any distinct instantiations of ts1 result in one additional CppRecord entity for the specialized types (3 in this case):

static void use_ts1()
{
    ts1<bool > lv0a;// 1 CppRecord for ts1<bool>
    ts1<bool > lv0b;// ts1<bool> already accounted for; no record
    ts1<long > lv1a;// 1 CppRecord for ts1<long>
    ts1<long > lv1b;// ts1<long> already accounted for; no record
    ts1<float> lv2a;// 1 CppRecord for ts1<float>
    ts1<float> lv2b;// ts1<float> already accounted for; no record
}

Similarly, having different names for template parameters has no effect. The code below still produces 1 CppRecord entity:

template<typename T0> struct ts2;
template<typename T1> struct ts2;
template<typename T2> struct ts2;
template<typename T > struct ts2 { T value; };

And instantiating it produces another 3 CppRecord entities:

static void use_ts2()
{
    ts2<bool > lv0a;// 1 CppRecord for ts2<bool>
    ts2<bool > lv0b;// ts2<bool> already accounted for; no record
    ts2<long > lv1a;// 1 CppRecord for ts2<long>
    ts2<long > lv1b;// ts2<long> already accounted for; no record
    ts2<float> lv2a;// 1 CppRecord for ts2<float>
    ts2<float> lv2b;// ts2<float> already accounted for; no record
}

An interesting side node is that the Info Tree displays ts2 correctly regardless of which declaration I select. Also, when I inspect the type of its value member, it says T. This means that the query (correctly) returns the name corresponding to the definition, and not the alternatives in the declarations (e.g. T0, T1, or T2). This is in contrast with the behavior observed in non-template functions where the first occurrence of the parameter name is returned.

A similar logic applies to having different defaults for the template parameters. The following code produces 5 CppRecord entities: 1 for the definition in main.h and 4 for the instantiations in the cpp-s:

// in a shared main.h header:
template<typename T> struct ts3 { T value; }; // 1 CppRecord for ts3

// in main.cpp, including main.h:
template<typename T> struct ts3; // no CppRecord; declaration only
static void use_ts3()
{
    ts3<int> ta;// 1 CppRecord for ts3<int>
        ts3<int> tb;// ts3<int> already accounted for; no record
}

// in def1.cpp, including main.h:
template<typename T = bool> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
    ts3<> ta;// 1 CppRecord for ts3<bool>
    ts3<> tb;// ts3<bool> already accounted for; no record
}

// in def2.cpp, including main.h:
template<typename T = long> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
    ts3<> ta;// 1 CppRecord for ts3<long>
    ts3<> tb;// ts3<long> already accounted for; no record
}

// in def3.cpp, including main.h:
template<typename T = float> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
    ts3<> ta;// 1 CppRecord for ts3<float>
    ts3<> tb;// ts3<float> already accounted for; no record
}

Template functions

The following code produces 3 CppFunction entities, even though they are all just declarations:

template<typename T> void tf0();
template<typename T> void tf0();
template<typename T> void tf0();

The following code also produces 3 CppFunction entities, even though only one of them is the definition:

template<typename T> void tf1();
template<typename T> void tf1() { T value; }
template<typename T> void tf1();

Furthermore, just like with records, any distinct instantiation of tf1 adds an extra CppFunction (3 in this case):

static void use_tf1()
{
    tf1<bool >();// 1 CppFunction for tf1<bool>
    tf1<bool >();// tf1<bool> already accounted for; no function
    tf1<long >();// 1 CppFunction for tf1<long>
    tf1<long >();// tf1<long> already accounted for; no function
    tf1<float>();// 1 CppFunction for tf1<float>
    tf1<float>();// tf1<float> already accounted for; no function
}

As expected, changing the name of the template parameter does not change the behavior, the code below still produces 4 CppFunction entities:

template<typename T0> void tf2();
template<typename T1> void tf2();
template<typename T2> void tf2();
template<typename T > void tf2() { T value; }

And any distinct instantiations add further CppFunction entities (3 in this case):

static void use_tf2()
{
    tf2<bool >();// 1 CppFunction for tf2<bool>
    tf2<bool >();// tf2<bool> already accounted for; no function
    tf2<long >();// 1 CppFunction for tf2<long>
    tf2<long >();// tf2<long> already accounted for; no function
    tf2<float>();// 1 CppFunction for tf2<float>
    tf2<float>();// tf2<float> already accounted for; no function
}

An interesting side note is that whichever declaration of tf2 I select, the Info Tree displays most of the function correctly. However, there is a difference between selecting the definition and any of the declarations:

Finally, a rather curious thing happens when we have different defaults for the template parameter. The following code produces 6 CppFunction entities: 2 for the definition in main.h (!?) and 1 for the re-declaration in the cpp-s, but no entities for the instantiations/calls (unlike with tf2):

// in a shared main.h header:
template<typename T> void tf3(T&& t) { T value = t; } // this actually produces 2 CppFunction-s for some reason...

// in main.cpp, including main.h:
template<typename T> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
    tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
    tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
}

// in def1.cpp, including main.h:
template<typename T = bool> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<bool>
    tf3(0);// no CppFunction for the instantiation of tf3<bool>
}

// in def2.cpp, including main.h:
template<typename T = long> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<long>
    tf3(0);// no CppFunction for the instantiation of tf3<long>
}

// in def3.cpp, including main.h:
template<typename T = float> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<float>
    tf3(0);// no CppFunction for the instantiation of tf3<float>
}

I do not have a full explanation for why this happens. There might be more to this issue that has to do with the fact that certain re-declarations reside in different compilation units...

dbukki commented 3 months ago

Assumptions on correctness

In VisitRecordDecl, just before the creation of the CppRecord structure that will later be persisted, there is the following check:

if (!rd_->isThisDeclarationADefinition())
    return true;

This is the piece of code responsible for skipping anything that only needs to happen for the definition of a record. (Which is also why for every struct, only one CppRecord is present in the database, regardless of the number of declarations.)

Assuming that this is the correct behavior - which is what I would advocate for at first glance anyway - we could add the same early-return check into VisitFunctionDecl and VisitVarDecl as well, just before the creation of the respective database entities. Doing so would change the above statistics for variables and functions as follows:

Variables

This code now produces no CppVariable entities, (as they are all declarations):

extern int v0;
extern int v0;
extern int v0;

For v0, the Info Tree displays "Variable: int v0", with the Declaration and Usage sections appearing correctly (with correct navigation too). However, the Name, Qualified name and Type fields are not present, as those would apparently need to come from a CppVariable entity, which this variable does not have.

This code now produces only 1 CppVariable entity for the one definition:

extern int v1;
       int v1;
extern int v1;

For v1, the InfoTree displays all information about the variable as expected, with all fields present. Navigation works correctly regardless of whether a declaration or the definition is selected.

Functions

This code now produces no CppFunction entities (as they are all declarations):

void f0();
void f0();
void f0();

Similarly to v0, the Info Tree displays "Function: void f0()", with the Declaration and Usage sections appearing correctly (with correct navigation too). However, the Name, Qualified name and Signature fields are not present, as those would apparently need to come from a CppFunction entity, which this function does not have.

This code now produces 1 CppFunction entity for the one definition:

void f1();
void f1() {}
void f1();

Similarly to v1, the Info Tree displays all information about f1 as expected, with all fields present. Navigation works correctly regardless of whether a declaration or the definition is selected.

This code now also produces 1 CppFunction entity for f2, and 4 CppVariable entities for p0, p1, p2 and the unnamed parameter.

void f2(int p0);
void f2(int p1) {}
void f2(int p2);
void f2(int);

Interesting side note: The Parameters section (showing p1) is only displayed when I select the definition. Apart from this, selecting the definition or a declaration produces the same behavior on the Info Tree.

This code now also produces 1 CppFunction entity for f3, and 5 CppVariable entities for p.

void f3(int p);     // in a shared main.h
void f3(int p) {}   // in main.cpp, including main.h
void f3(int p = 0); // in def1.cpp, including main.h
void f3(int p = 1); // in def2.cpp, including main.h
void f3(int p = 2); // in def3.cpp, including main.h

A similar observation can be made here as with f2: The Parameters section is only displayed when selecting the definition. Apart from that, everything is the same and works just fine. Default values of parameters are still not displayed anywhere, just like in the baseline case (without the modifications in this post).

Template functions

The following code now produces no CppFunction entities (as they are all declarations):

template<typename T> void tf0();
template<typename T> void tf0();
template<typename T> void tf0();

Info Tree behavior is same as with f0.

The following code now produces 1 CppFunction entity for the one definition:

template<typename T> void tf1();
template<typename T> void tf1() { T value; }
template<typename T> void tf1();

Info Tree behavior is same as with f1. Furthermore, just like in the baseline case, any distinct instantiation of tf1 adds an extra CppFunction (3 in this case):

static void use_tf1()
{
    tf1<bool >();// 1 CppFunction for tf1<bool>
    tf1<bool >();// tf1<bool> already accounted for; no function
    tf1<long >();// 1 CppFunction for tf1<long>
    tf1<long >();// tf1<long> already accounted for; no function
    tf1<float>();// 1 CppFunction for tf1<float>
    tf1<float>();// tf1<float> already accounted for; no function
}

As expected, changing the name of the template parameter does not change the behavior, the code below still produces 1 CppFunction entity:

template<typename T0> void tf2();
template<typename T1> void tf2();
template<typename T2> void tf2();
template<typename T > void tf2() { T value; }

And any distinct instantiations add further CppFunction entities (3 in this case):

static void use_tf2()
{
    tf2<bool >();// 1 CppFunction for tf2<bool>
    tf2<bool >();// tf2<bool> already accounted for; no function
    tf2<long >();// 1 CppFunction for tf2<long>
    tf2<long >();// tf2<long> already accounted for; no function
    tf2<float>();// 1 CppFunction for tf2<float>
    tf2<float>();// tf2<float> already accounted for; no function
}

Everything that could be said about the behavior of the Info Tree in the baseline case also applies here:

Finally, the "curious" case for when we have different defaults for the template parameter: The following code produces 2 CppFunction entities or the definition in main.h (!?) and no further entities for the re-declarations in the cpp-s or the instantiations/calls (unlike with tf2):

// in a shared main.h header:
template<typename T> void tf3(T&& t) { T value = t; } // this actually produces 2 CppFunction-s for some reason...

// in main.cpp, including main.h:
template<typename T> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
    tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
    tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
}

// in def1.cpp, including main.h:
template<typename T = bool> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<bool>
    tf3(0);// no CppFunction for the instantiation of tf3<bool>
}

// in def2.cpp, including main.h:
template<typename T = long> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<long>
    tf3(0);// no CppFunction for the instantiation of tf3<long>
}

// in def3.cpp, including main.h:
template<typename T = float> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
    tf3(0);// no CppFunction for the instantiation of tf3<float>
    tf3(0);// no CppFunction for the instantiation of tf3<float>
}

Database size

Apart from the behavioral changes listed above, eliminating duplicate definitions of variables and functions also affects the database size. I did a file size comparison using the SQLite database configuration for TinyXML and Xerces-C before and after applying the changes outlined at the beginning of this post. The results are as follows (sizes are given in bytes):

TinyXML: Original: 8 785 920 Modified: 8 552 448 Redundancy: 233 472 (2.66%)

Xerces-C: Original: 217 526 272 Modified: 210 067 456 Redundancy: 7 458 816 (3.43%)

But perhaps more interesting is the comparison between the sizes of the CppVariable and CppFunction tables themselves (e.g. number of records) in these repositories:

TinyXML: Entity Original Modified Redundancy Redundancy (%)
Variables 3159 3122 37 1.17%
Functions 1615 647 968 59.94%
Xerces-C: Entity Original Modified Redundancy Redundancy (%)
Variables 66051 64947 1104 1.67%
Functions 34858 15717 19141 54.91%

Even though the redundancy is not as pronounced in overall database size as I would have expected, it is definitely significant in the size of the CppFunction table. If applying such simple changes could get rid of over half of the records in this table, then any query iterating over these records would instantly get twice as fast. As the number of metrics added to CodeCompass increases, so does the potential number of queries that might access (iterate or join) the CppFunction table at some point. And then this difference is definitely going to be significant in the runtime of the metrics plugin.

dbukki commented 2 months ago

The above statistics were exported from my diagnostics branch created specifically for this purpose: https://github.com/dbukki/CodeCompass/tree/diagnostics

Note: The changes on this branch are meant to serve diagnostic purposes of hypothetical scenarios only (i.e. what would happen to CodeCompass if...?). As expected, for the reasons already outlined in https://github.com/Ericsson/CodeCompass/pull/714#issuecomment-1983388202, these changes make certain unit tests fail, thus making the CI for the whole branch fail as well.

The sample codes I used to analyze the CC database can be found at https://github.com/dbukki/cc-tests/tree/master/decl_def