Open mcserep opened 8 months ago
Depending on the kind of CppEntity
being parsed, declarations and instantiations/calls may or may not persist additional (duplicate) entities for the relevant pieces of code into the database.
Here are my observations on the current state of our master branch (without any modifications):
This code produces 3 CppVariable
entities, even though they are all technically just declarations:
extern int v0;
extern int v0;
extern int v0;
This code also produces 3 CppVariable
entities, even though only one of them is the definition:
extern int v1;
int v1;
extern int v1;
Note that an AST node is also generated for each of them, so navigation on the UI is working fine.
This code produces no CppRecord
entities, as there is no definition:
struct s0;
struct s0;
struct s0;
This code produces 1 CppRecord
entity for the one definition:
struct s1;
struct s1 {};
struct s1;
Note that an AST node is still generated for each of them regardless of definition, so navigation on the UI is working fine.
This code produces 3 CppFunction
entities, even though they are all declarations:
void f0();
void f0();
void f0();
This code also produces 3 CppFunction
entities, even though only one of them is the definition:
void f1();
void f1() {}
void f1();
Note that an AST node is also generated for each declaration and definition as well, so navigation on the UI is working fine.
This code produces 4 CppFunction
entities for f2
, and 4 CppVariable
entities for p0
, p1
, p2
and the unnamed parameter.
void f2(int p0);
void f2(int p1) {}
void f2(int p2);
void f2(int);
Interesting side note: When I select either of the f2
variants on the UI, the Info Tree displays p0
as the parameter of f2
, regardless of which declaration I select. It seems that parameter info are queried on a "first entity that matches the function hash" basis. This means that getting rid of multiple CppFunction
entities would probably still result in a correct match. If we only generated CppVariable
entities for the definition, that would also always result in that name being returned which will actually be used in the function's body. This would be more logical here, as the parameter whose value would actually be used is p1
in this case.
This code produces 5 CppFunction
entities for f3
, and 5 CppVariable
entities for p
.
void f3(int p); // in a shared main.h
void f3(int p) {} // in main.cpp, including main.h
void f3(int p = 0); // in def1.cpp, including main.h
void f3(int p = 1); // in def2.cpp, including main.h
void f3(int p = 2); // in def3.cpp, including main.h
Again, an interesting side note here is that regardless of which declaration I select, the default values are not displayed on the Info Tree anyway, so it probably wouldn't matter if the multiple definitions of p
were reduced to just one. (Unless the lack of default argument display is already an issue on its own...)
A similar pattern can be observed with records and functions when making them into templates.
The following code produces no CppRecord
entities, as there is no definition:
template<typename T> struct ts0;
template<typename T> struct ts0;
template<typename T> struct ts0;
The following code produces 1 CppRecord
entity for the one definition:
template<typename T> struct ts1;
template<typename T> struct ts1 { T value; };
template<typename T> struct ts1;
Furthermore, any distinct instantiations of ts1
result in one additional CppRecord
entity for the specialized types (3 in this case):
static void use_ts1()
{
ts1<bool > lv0a;// 1 CppRecord for ts1<bool>
ts1<bool > lv0b;// ts1<bool> already accounted for; no record
ts1<long > lv1a;// 1 CppRecord for ts1<long>
ts1<long > lv1b;// ts1<long> already accounted for; no record
ts1<float> lv2a;// 1 CppRecord for ts1<float>
ts1<float> lv2b;// ts1<float> already accounted for; no record
}
Similarly, having different names for template parameters has no effect. The code below still produces 1 CppRecord
entity:
template<typename T0> struct ts2;
template<typename T1> struct ts2;
template<typename T2> struct ts2;
template<typename T > struct ts2 { T value; };
And instantiating it produces another 3 CppRecord
entities:
static void use_ts2()
{
ts2<bool > lv0a;// 1 CppRecord for ts2<bool>
ts2<bool > lv0b;// ts2<bool> already accounted for; no record
ts2<long > lv1a;// 1 CppRecord for ts2<long>
ts2<long > lv1b;// ts2<long> already accounted for; no record
ts2<float> lv2a;// 1 CppRecord for ts2<float>
ts2<float> lv2b;// ts2<float> already accounted for; no record
}
An interesting side node is that the Info Tree displays ts2
correctly regardless of which declaration I select. Also, when I inspect the type of its value
member, it says T
. This means that the query (correctly) returns the name corresponding to the definition, and not the alternatives in the declarations (e.g. T0
, T1
, or T2
). This is in contrast with the behavior observed in non-template functions where the first occurrence of the parameter name is returned.
A similar logic applies to having different defaults for the template parameters.
The following code produces 5 CppRecord
entities: 1 for the definition in main.h and 4 for the instantiations in the cpp-s:
// in a shared main.h header:
template<typename T> struct ts3 { T value; }; // 1 CppRecord for ts3
// in main.cpp, including main.h:
template<typename T> struct ts3; // no CppRecord; declaration only
static void use_ts3()
{
ts3<int> ta;// 1 CppRecord for ts3<int>
ts3<int> tb;// ts3<int> already accounted for; no record
}
// in def1.cpp, including main.h:
template<typename T = bool> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
ts3<> ta;// 1 CppRecord for ts3<bool>
ts3<> tb;// ts3<bool> already accounted for; no record
}
// in def2.cpp, including main.h:
template<typename T = long> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
ts3<> ta;// 1 CppRecord for ts3<long>
ts3<> tb;// ts3<long> already accounted for; no record
}
// in def3.cpp, including main.h:
template<typename T = float> struct ts3;// no CppRecord; declaration only
static void use_ts3()
{
ts3<> ta;// 1 CppRecord for ts3<float>
ts3<> tb;// ts3<float> already accounted for; no record
}
The following code produces 3 CppFunction
entities, even though they are all just declarations:
template<typename T> void tf0();
template<typename T> void tf0();
template<typename T> void tf0();
The following code also produces 3 CppFunction
entities, even though only one of them is the definition:
template<typename T> void tf1();
template<typename T> void tf1() { T value; }
template<typename T> void tf1();
Furthermore, just like with records, any distinct instantiation of tf1
adds an extra CppFunction
(3 in this case):
static void use_tf1()
{
tf1<bool >();// 1 CppFunction for tf1<bool>
tf1<bool >();// tf1<bool> already accounted for; no function
tf1<long >();// 1 CppFunction for tf1<long>
tf1<long >();// tf1<long> already accounted for; no function
tf1<float>();// 1 CppFunction for tf1<float>
tf1<float>();// tf1<float> already accounted for; no function
}
As expected, changing the name of the template parameter does not change the behavior, the code below still produces 4 CppFunction
entities:
template<typename T0> void tf2();
template<typename T1> void tf2();
template<typename T2> void tf2();
template<typename T > void tf2() { T value; }
And any distinct instantiations add further CppFunction
entities (3 in this case):
static void use_tf2()
{
tf2<bool >();// 1 CppFunction for tf2<bool>
tf2<bool >();// tf2<bool> already accounted for; no function
tf2<long >();// 1 CppFunction for tf2<long>
tf2<long >();// tf2<long> already accounted for; no function
tf2<float>();// 1 CppFunction for tf2<float>
tf2<float>();// tf2<float> already accounted for; no function
}
An interesting side note is that whichever declaration of tf2
I select, the Info Tree displays most of the function correctly. However, there is a difference between selecting the definition and any of the declarations:
use_tf2
.use_tf2
when the definition is selected.Finally, a rather curious thing happens when we have different defaults for the template parameter.
The following code produces 6 CppFunction
entities: 2 for the definition in main.h (!?) and 1 for the re-declaration in the cpp-s, but no entities for the instantiations/calls (unlike with tf2
):
// in a shared main.h header:
template<typename T> void tf3(T&& t) { T value = t; } // this actually produces 2 CppFunction-s for some reason...
// in main.cpp, including main.h:
template<typename T> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
}
// in def1.cpp, including main.h:
template<typename T = bool> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<bool>
tf3(0);// no CppFunction for the instantiation of tf3<bool>
}
// in def2.cpp, including main.h:
template<typename T = long> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<long>
tf3(0);// no CppFunction for the instantiation of tf3<long>
}
// in def3.cpp, including main.h:
template<typename T = float> void tf3(T&& t);// 1 CppFunction for the declaration
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<float>
tf3(0);// no CppFunction for the instantiation of tf3<float>
}
I do not have a full explanation for why this happens. There might be more to this issue that has to do with the fact that certain re-declarations reside in different compilation units...
In VisitRecordDecl
, just before the creation of the CppRecord
structure that will later be persisted, there is the following check:
if (!rd_->isThisDeclarationADefinition())
return true;
This is the piece of code responsible for skipping anything that only needs to happen for the definition of a record. (Which is also why for every struct, only one CppRecord
is present in the database, regardless of the number of declarations.)
Assuming that this is the correct behavior - which is what I would advocate for at first glance anyway - we could add the same early-return check into VisitFunctionDecl
and VisitVarDecl
as well, just before the creation of the respective database entities. Doing so would change the above statistics for variables and functions as follows:
This code now produces no CppVariable
entities, (as they are all declarations):
extern int v0;
extern int v0;
extern int v0;
For v0
, the Info Tree displays "Variable: int v0", with the Declaration and Usage sections appearing correctly (with correct navigation too). However, the Name, Qualified name and Type fields are not present, as those would apparently need to come from a CppVariable entity, which this variable does not have.
This code now produces only 1 CppVariable
entity for the one definition:
extern int v1;
int v1;
extern int v1;
For v1
, the InfoTree displays all information about the variable as expected, with all fields present. Navigation works correctly regardless of whether a declaration or the definition is selected.
This code now produces no CppFunction
entities (as they are all declarations):
void f0();
void f0();
void f0();
Similarly to v0
, the Info Tree displays "Function: void f0()", with the Declaration and Usage sections appearing correctly (with correct navigation too). However, the Name, Qualified name and Signature fields are not present, as those would apparently need to come from a CppFunction
entity, which this function does not have.
This code now produces 1 CppFunction
entity for the one definition:
void f1();
void f1() {}
void f1();
Similarly to v1
, the Info Tree displays all information about f1
as expected, with all fields present. Navigation works correctly regardless of whether a declaration or the definition is selected.
This code now also produces 1 CppFunction
entity for f2
, and 4 CppVariable
entities for p0
, p1
, p2
and the unnamed parameter.
void f2(int p0);
void f2(int p1) {}
void f2(int p2);
void f2(int);
Interesting side note: The Parameters section (showing p1
) is only displayed when I select the definition. Apart from this, selecting the definition or a declaration produces the same behavior on the Info Tree.
This code now also produces 1 CppFunction
entity for f3
, and 5 CppVariable
entities for p
.
void f3(int p); // in a shared main.h
void f3(int p) {} // in main.cpp, including main.h
void f3(int p = 0); // in def1.cpp, including main.h
void f3(int p = 1); // in def2.cpp, including main.h
void f3(int p = 2); // in def3.cpp, including main.h
A similar observation can be made here as with f2
: The Parameters section is only displayed when selecting the definition. Apart from that, everything is the same and works just fine. Default values of parameters are still not displayed anywhere, just like in the baseline case (without the modifications in this post).
The following code now produces no CppFunction
entities (as they are all declarations):
template<typename T> void tf0();
template<typename T> void tf0();
template<typename T> void tf0();
Info Tree behavior is same as with f0
.
The following code now produces 1 CppFunction
entity for the one definition:
template<typename T> void tf1();
template<typename T> void tf1() { T value; }
template<typename T> void tf1();
Info Tree behavior is same as with f1
.
Furthermore, just like in the baseline case, any distinct instantiation of tf1
adds an extra CppFunction
(3 in this case):
static void use_tf1()
{
tf1<bool >();// 1 CppFunction for tf1<bool>
tf1<bool >();// tf1<bool> already accounted for; no function
tf1<long >();// 1 CppFunction for tf1<long>
tf1<long >();// tf1<long> already accounted for; no function
tf1<float>();// 1 CppFunction for tf1<float>
tf1<float>();// tf1<float> already accounted for; no function
}
As expected, changing the name of the template parameter does not change the behavior, the code below still produces 1 CppFunction
entity:
template<typename T0> void tf2();
template<typename T1> void tf2();
template<typename T2> void tf2();
template<typename T > void tf2() { T value; }
And any distinct instantiations add further CppFunction
entities (3 in this case):
static void use_tf2()
{
tf2<bool >();// 1 CppFunction for tf2<bool>
tf2<bool >();// tf2<bool> already accounted for; no function
tf2<long >();// 1 CppFunction for tf2<long>
tf2<long >();// tf2<long> already accounted for; no function
tf2<float>();// 1 CppFunction for tf2<float>
tf2<float>();// tf2<float> already accounted for; no function
}
Everything that could be said about the behavior of the Info Tree in the baseline case also applies here:
Finally, the "curious" case for when we have different defaults for the template parameter:
The following code produces 2 CppFunction
entities or the definition in main.h (!?) and no further entities for the re-declarations in the cpp-s or the instantiations/calls (unlike with tf2
):
// in a shared main.h header:
template<typename T> void tf3(T&& t) { T value = t; } // this actually produces 2 CppFunction-s for some reason...
// in main.cpp, including main.h:
template<typename T> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
tf3<int>(0);// no CppFunction for the instantiation of tf3<int>
}
// in def1.cpp, including main.h:
template<typename T = bool> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<bool>
tf3(0);// no CppFunction for the instantiation of tf3<bool>
}
// in def2.cpp, including main.h:
template<typename T = long> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<long>
tf3(0);// no CppFunction for the instantiation of tf3<long>
}
// in def3.cpp, including main.h:
template<typename T = float> void tf3(T&& t);// no CppFunction; declaration only
static void use_tf3()
{
tf3(0);// no CppFunction for the instantiation of tf3<float>
tf3(0);// no CppFunction for the instantiation of tf3<float>
}
Apart from the behavioral changes listed above, eliminating duplicate definitions of variables and functions also affects the database size. I did a file size comparison using the SQLite database configuration for TinyXML and Xerces-C before and after applying the changes outlined at the beginning of this post. The results are as follows (sizes are given in bytes):
TinyXML: Original: 8 785 920 Modified: 8 552 448 Redundancy: 233 472 (2.66%)
Xerces-C: Original: 217 526 272 Modified: 210 067 456 Redundancy: 7 458 816 (3.43%)
But perhaps more interesting is the comparison between the sizes of the CppVariable and CppFunction tables themselves (e.g. number of records) in these repositories:
TinyXML: | Entity | Original | Modified | Redundancy | Redundancy (%) |
---|---|---|---|---|---|
Variables | 3159 | 3122 | 37 | 1.17% | |
Functions | 1615 | 647 | 968 | 59.94% |
Xerces-C: | Entity | Original | Modified | Redundancy | Redundancy (%) |
---|---|---|---|---|---|
Variables | 66051 | 64947 | 1104 | 1.67% | |
Functions | 34858 | 15717 | 19141 | 54.91% |
Even though the redundancy is not as pronounced in overall database size as I would have expected, it is definitely significant in the size of the CppFunction
table. If applying such simple changes could get rid of over half of the records in this table, then any query iterating over these records would instantly get twice as fast. As the number of metrics added to CodeCompass increases, so does the potential number of queries that might access (iterate or join) the CppFunction
table at some point. And then this difference is definitely going to be significant in the runtime of the metrics plugin.
The above statistics were exported from my diagnostics branch created specifically for this purpose: https://github.com/dbukki/CodeCompass/tree/diagnostics
Note: The changes on this branch are meant to serve diagnostic purposes of hypothetical scenarios only (i.e. what would happen to CodeCompass if...?). As expected, for the reasons already outlined in https://github.com/Ericsson/CodeCompass/pull/714#issuecomment-1983388202, these changes make certain unit tests fail, thus making the CI for the whole branch fail as well.
The sample codes I used to analyze the CC database can be found at https://github.com/dbukki/cc-tests/tree/master/decl_def
As @dbukki discovered in https://github.com/Ericsson/CodeCompass/pull/714#issuecomment-1983388202, multiple
CppFunction
entities are persisted for each function: one for each declaration and one for the definition.It should be examined whether these multiple entities are truly required or not. Key aspects to focus on during the examination:
CppFunction
entity for declarations, or maybe is it enough to store one for the definition? What will happen in such a case when a function does not have a definition, but only declaration? Would the Info Tree still be able to show the available information about the function?