Closed mcserep closed 5 months ago
I have a question. How should we handle nested classes?
class A {
void f();
void g();
class B {
void h();
}
};
Should we add the complexity of B
(coming from h
) to A
?
Conceptually, I like the idea to penalize containing classes for each of their inner class since they fall under them as a logical unit. This would also discourage using too much class nesting overall.
However, from a practical standpoint, it might be inconvenient for the user to have complexities counted twice or even more times in case of deeper nesting, e.g. A::B::C
.
@mcserep @intjftw What's your opinion?
@Seeker04 As far as I remember we counted "nested functions" (lamdas) towards a function's McCabe metric in #689, since it increases the complexity of that function to understand. (Should be checked.)
With the same argument a nested type increases the complexity of the container type, therefore I would count it in.
Progress
Running the following query on the parsing result of tinyxml2
select astValue, value from (CppAstNodeMetrics m join CppAstNode n on m.astNodeId = n.id) where type = 3;
yields the following table:
class DepthTracker | 3.0
class DynArray | 13.0
class DynArray | 14.0
class DynArray | 14.0
class DynArray | 14.0
class DynArray | 14.0
class DynArray | 14.0
class DynArray | 18.0
class DynArray | 19.0
class DynArray | 37.0
class MemPool | 4.0
class MemPoolT : public MemPool | 13.0
class MemPoolT : public MemPool | 13.0
class MemPoolT : public MemPool | 13.0
class MemPoolT : public MemPool | 13.0
class MemPoolT : public MemPool | 16.0
class TINYXML2_LIB StrPair | 63.0
class TINYXML2_LIB XMLAttribute | 43.0
class TINYXML2_LIB XMLComment : public XMLNode | 13.0
class TINYXML2_LIB XMLConstHandle | 30.0
class TINYXML2_LIB XMLDeclaration : public XMLNode | 13.0
class TINYXML2_LIB XMLDocument : public XMLNode | 132.0
class TINYXML2_LIB XMLElement : public XMLNode | 169.0
class TINYXML2_LIB XMLHandle | 30.0
class TINYXML2_LIB XMLNode | 140.0
class TINYXML2_LIB XMLPrinter : public XMLVisitor | 86.0
class TINYXML2_LIB XMLText : public XMLNode | 21.0
class TINYXML2_LIB XMLUnknown : public XMLNode | 13.0
class TINYXML2_LIB XMLUtil | 110.0
class TINYXML2_LIB XMLVisitor | 12.0
struct Block | 4.0
struct Block | 4.0
struct Block | 4.0
struct Block | 4.0
struct Entity | 1.0
struct TestUtil: XMLVisitor | 15.0
union Item | 4.0
union Item | 4.0
union Item | 4.0
union Item | 4.0
Most values are correct, however there are a few problems:
DynArray
, MemPoolT
. Each template instantiation results in a class AST node, so it is understandable that we have duplicates with the current implementation as this has no special handling. However, mccabe values between the instantiations differ which is odd, because from the definition of these specific class templates, I don't see how different instantiations could have different mccabe values...Block
, Item
and Entity
are all classes without any methods so they shouldn't appear among these metrics. Block
and Item
are under template classes (that's also the reason they are all duplicated several times), however Entity
is not, so this issue is not exclusive to templates.
Code of Entity
:
struct Entity {
const char* pattern;
int length;
char value;
};
Note: In order to use valgrind, I needed to run ulimit -n 1024
, because the allowed number of open file descriptors was ridiculously high in the docker image by default: 1073741816
This solved the valgrind error:
valgrind: m_libcfile.c:66 (vgPlain_safe_fd): Assertion 'newfd >= VG_(fd_hard_limit)' failed.
Segmentation fault (core dumped)
The image was built using the Docker guide.
The failed assertion occurs because of this:
If the query executed using query_one() or query_value() returns more than one element, then these functions fail with an assertion.
In the current state the following query on tinyxml2:
select astValue, value from (CppAstNodeMetrics m join CppAstNode n on m.astNodeId = n.id) where type = 3;
yields:
class DepthTracker { | 2
class DynArray { | 13
class DynArray { | 14
class DynArray { | 14
class DynArray { | 14
class DynArray { | 14
class DynArray { | 14
class DynArray { | 18
class DynArray { | 19
class DynArray { | 37
class MemPool { | 2
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 16
class TINYXML2_LIB StrPair { | 63
class TINYXML2_LIB XMLAttribute { | 43
class TINYXML2_LIB XMLComment : public XMLNode { | 13
class TINYXML2_LIB XMLConstHandle { | 29
class TINYXML2_LIB XMLDeclaration : public XMLNode { | 13
class TINYXML2_LIB XMLDocument : public XMLNode { | 132
class TINYXML2_LIB XMLElement : public XMLNode { | 169
class TINYXML2_LIB XMLHandle { | 29
class TINYXML2_LIB XMLNode { | 140
class TINYXML2_LIB XMLPrinter : public XMLVisitor { | 86
class TINYXML2_LIB XMLText : public XMLNode { | 21
class TINYXML2_LIB XMLUnknown : public XMLNode { | 13
class TINYXML2_LIB XMLUtil { | 110
class TINYXML2_LIB XMLVisitor { | 9
struct TestUtil: XMLVisitor { | 13
Classes with only implicit methods are no longer present and the metric of the others are also correct now.
Template classes still cause some duplicates. A potential solution to that would be to use getTemplateSpecializationKind in the parser checking the TemplateSpecializationKind. Then a new model::Tag
could be added for entities (classes and perhaps functions too) which are implicit template instantiations (TSK_ImplicitInstantiation
). Then we could simply filter out any such classes from the summation.
As discussed today, first the types of the project should be considered and only then the methods of those types should be queried. This should be also more efficient.
I did some investigation. We see results like this for template classes:
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 13
class MemPoolT : public MemPool { | 16
because some methods are only generated for the original template AST node and not for the template instantiations.
In case of MemPoolT
, there are three such methods, hence the 13 and 16 values:
int tinyxml2::MemPoolT::ItemSize() const
int tinyxml2::MemPoolT<104>::ItemSize() const
int tinyxml2::MemPoolT<112>::ItemSize() const
int tinyxml2::MemPoolT<120>::ItemSize() const
int tinyxml2::MemPoolT<80>::ItemSize() const
tinyxml2::MemPoolT::MemPoolT<ITEM_SIZE>()
tinyxml2::MemPoolT<104>::MemPoolT()
tinyxml2::MemPoolT<112>::MemPoolT()
tinyxml2::MemPoolT<120>::MemPoolT()
tinyxml2::MemPoolT<80>::MemPoolT()
tinyxml2::MemPoolT::~MemPoolT<ITEM_SIZE>()
tinyxml2::MemPoolT<104>::~MemPoolT()
tinyxml2::MemPoolT<112>::~MemPoolT()
tinyxml2::MemPoolT<120>::~MemPoolT()
tinyxml2::MemPoolT<80>::~MemPoolT()
void * tinyxml2::MemPoolT::Alloc()
void * tinyxml2::MemPoolT<104>::Alloc()
void * tinyxml2::MemPoolT<112>::Alloc()
void * tinyxml2::MemPoolT<120>::Alloc()
void * tinyxml2::MemPoolT<80>::Alloc()
void tinyxml2::MemPoolT::Clear()
void tinyxml2::MemPoolT<104>::Clear()
void tinyxml2::MemPoolT<112>::Clear()
void tinyxml2::MemPoolT<120>::Clear()
void tinyxml2::MemPoolT<80>::Clear()
void tinyxml2::MemPoolT::Free(void *)
void tinyxml2::MemPoolT<104>::Free(void *)
void tinyxml2::MemPoolT<112>::Free(void *)
void tinyxml2::MemPoolT<120>::Free(void *)
void tinyxml2::MemPoolT<80>::Free(void *)
void tinyxml2::MemPoolT::SetTracked()
void tinyxml2::MemPoolT<104>::SetTracked()
void tinyxml2::MemPoolT<112>::SetTracked()
void tinyxml2::MemPoolT<120>::SetTracked()
void tinyxml2::MemPoolT<80>::SetTracked()
void tinyxml2::MemPoolT::Trace(const char *)
int tinyxml2::MemPoolT::CurrentAllocs() const
int tinyxml2::MemPoolT::Untracked() const
My first hunch was that these are the ones that does not depend on the template argument, but that's not true, because Trace()
uses the template argument, ITEM_SIZE
.
It looks like the method AST nodes for the template instantiations are only generated on demand, i.e., if they are used in the translation unit. Trace()
, CurrentAllocs()
and Untracked()
are not used anywhere in tinyxml2 (only in debug builds).
This would also explain the even more diverse set of values for DynArray
: 13, 14, 14, 14, 14, 14, 18, 19, 37, where 37 is the correct value. Different template instantiations might use a different set of methods which would require the compiler to generate a different number of method AST nodes for certain instantiations.
This is covered in [temp.inst] in the C++ standard (In C++11, this is §14.7.1/10. In C++14, it is §14.7.1/11, and in C++17 it is §17.7.1/9. Excerpt from C++17 below) - source
An implementation shall not implicitly instantiate a function template, a variable template, a member template, a non-virtual member function, a member class, a static data member of a class template, or a substatement of a constexpr if statement (9.4.1), unless such instantiation is required
The "shall not" part suggests that this is not enforced and is probably compiler dependant. Simpler/older compilers might generate all methods, but g++ 11.4.0 does not.
I still think, the best way to remove these duplicates would be to somehow include information about a record being a template class or template instantiation in the database (by tagging it, for example). Because there is no way to tell the templates and template instantiations apart by observing the current database (i.e., CppAstNode.astValue
, Entity.name
, Entity.qualifiedName
, none of these include the template argument list).
If we could distinguish between the template AST node and its instantiations, then we could simply only include the former in the type McCabe metrics.
The "shall not" part suggests that this is not enforced and is probably compiler dependant. Simpler/older compilers might generate all methods, but g++ 11.4.0 does not.
In programming language standards "shall" is the synonym for "must" and "shall not" is a synonym for "must not". "Should" and "should not" has the meaning of optionality.
(See e.g. ISO TR 10176, Information technology - Guidelines for the preparation of programming language standards for reference.
So it is not compiler dependent (and never was).
I still think, the best way to remove these duplicates would be to somehow include information about a record being a template class or template instantiation in the database (by tagging it, for example).
Let's proceed with that approach then.
Is it valid though that the complexity numbers are so different for these template method instantiations?
Yes, it is valid. The three methods which are not used by the template instantiations of MemPoolT
are the reason for the 13 values opposed to the 16 which appears for the AST node of the original template.
I implemented the logic that marks and then skips template instantiations - other than explicit specializations - during the metric calculation. It's in the latest commit: 0bf929a
Now, only the proper values appear for the template classes DynArray
and MemPoolT
. Also, inner classes of class templates such as Block
and Item
appear only once in the new database as another result.
There are still duplicate entities in xerces-c for the following query:
// Lookup the definition (different AST node if not defined in class body)
const auto methodDef = _ctx.db->query_one<AstNode>(
odb::query<AstNode>::entityHash == methodAstNode->entityHash &&
odb::query<AstNode>::symbolType == AstNode::SymbolType::Function &&
odb::query<AstNode>::astType == AstNode::AstType::Definition);
These duplicates can be extracted with the query:
select astValue, count(*) from CppAstNode
where symbolType = 1 and astType = 3
group by entityHash having count(*) > 1
order by astValue;
which yields the following result:
StrX::StrX(const StrX &) | 20
StrX::StrX(const XMLCh *const) | 21
StrX::~StrX() | 21
XStr::XStr(const XStr &) | 4
XStr::XStr(const char *const) | 4
XStr::~XStr() | 4
const XMLCh * XStr::unicodeForm() const | 4
const char * StrX::localForm() const | 21
decltype(~std::forward<_Tp>(__t)) std::bit_not<void>::operator()(_Tp &&) const | 2
int main() | 5
int main(int, char **) | 26
std::ostream & operator<<(std::ostream &, const StrX &) | 19
void tassert(_Bool, const char *, int) | 3
void usage() | 10
This is not a parsing defect, these functions indeed have the displayed number of definitions in the project. This happens, because they are compiled to different binaries, so ODR is not violated. They go to several test binaries, see tests/CMakeLists.txt.
An interesting observation is that the usages()
function has 10 static and 10 non-static definitions and only the non-static ones appear as duplicates in the query. This means that the static functions have their translation unit mangled to their hash, however non-static ones and member functions do not. The hash is generated in getUSR()
using generateUSRForDecl. Function bodies don't seem to affect the USR, because the main()
functions, for example, have different bodies, but still get the same hash.
On the CodeCompass GUI, if I select a node for usages()
, I get all 10 definitions:
Moreover, right clicking on a reference and hitting "Jump to definition" lists out all of them:
This means that the conventional queries are also not smart enough to find the definition that belongs to the corresponding translation unit. I suspect, this is not possible with the current database. CodeCompass does not handle cases where symbols of the same hash (e.g. same signature) get compiled to different binaries.
Either the hash had to be extended (by adding a mangled translation unit name perhaps?), or we would need some new field to store this information in CppAstNode
.
@mcserep @intjftw We should decide whether this problem should belong to the scope of this issue.
If yes, we should decide how to proceed.
If not, then I can replace the query_one
call with query
and take the first result (if any). This would introduce a potential inaccuracy in this metrics, but it's arguably negligible.
In xerces-c, 14 of the 15886 function definition nodes have such duplicate entities. This is a 0.08% ratio, and it also includes functions that are not methods (like main()
) which mean no problem for us here. Also, most duplicates have the same definition and hence the same McCabe value.
Of course, it is a bit arbitrary to only consider xerces-c. Other projects could potentially be affected more by this issue (though I doubt, this would be too common).
@Seeker04 See #198 about the issue of project-level but not ODR-level (program-level) duplications. That patch stalled, and I highly suspect the main reason is that it was developed in a world prior to incremental analysis (#266) making its way into CodeCompass, and as it stands, #198 is highly incompatible with the latter development.
@whisperity Thank you, this is indeed the information we would need to make this metric more accurate.
@mcserep @intjftw I pushed a workaround that considers the first function definition found in these cases (and also rebased my branch). Now there are no assert failed errors, all CI checks passed. Currently, I don't see a better solution without the linkage information @whisperity mentioned and linked above, so I'm removing the draft status from the PR and adding you as reviewers.
See #682 for Cyclomatic Complexity at Function Level.
Adapted to the object-oriented paradigm, this metric can also be defined for classes as the sum of its methods complexity metric.