citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.51k stars 666 forks source link

Long Functions and Long Files #3173

Open SaitTalhaNisanci opened 4 years ago

SaitTalhaNisanci commented 4 years ago

After having the opportunity to see some parts of the codebase, I would like to make 2 suggestions regarding long functions and files. We have some very long functions and files:

Long Functions

We have a style checker but it doesnt check function complexities. I think it would be cool to have a tool that checks function complexities ( Like if we have very nested if else etc). When we have long functions, it is hard to follow what a function does in a high level. However, when we replace the lines that make a logical block with a function, it is easier to see the high level and we can go to a function to see the details of what is done.

For example a random method:

old method ```c static List *MapTaskList(MapMergeJob *mapMergeJob, List *filterTaskList) { List *mapTaskList = NIL; Query *filterQuery = mapMergeJob->job.jobQuery; List *rangeTableList = filterQuery->rtable; ListCell *filterTaskCell = NULL; Var *partitionColumn = mapMergeJob->partitionColumn; Oid partitionColumnType = partitionColumn->vartype; char *partitionColumnTypeFullName = format_type_be_qualified(partitionColumnType); int32 partitionColumnTypeMod = partitionColumn->vartypmod; char *partitionColumnName = NULL; List *groupClauseList = filterQuery->groupClause; if (groupClauseList != NIL) { List *targetEntryList = filterQuery->targetList; List *groupTargetEntryList = GroupTargetEntryList(groupClauseList, targetEntryList); TargetEntry *groupByTargetEntry = (TargetEntry *) linitial(groupTargetEntryList); partitionColumnName = groupByTargetEntry->resname; } else { partitionColumnName = ColumnName(partitionColumn, rangeTableList); } foreach(filterTaskCell, filterTaskList) { Task *filterTask = (Task *) lfirst(filterTaskCell); uint64 jobId = filterTask->jobId; uint32 taskId = filterTask->taskId; Task *mapTask = NULL; /* wrap repartition query string around filter query string */ StringInfo mapQueryString = makeStringInfo(); char *filterQueryString = filterTask->queryString; char *filterQueryEscapedText = quote_literal_cstr(filterQueryString); PartitionType partitionType = mapMergeJob->partitionType; if (partitionType == RANGE_PARTITION_TYPE) { ShardInterval **intervalArray = mapMergeJob->sortedShardIntervalArray; uint32 intervalCount = mapMergeJob->partitionCount; ArrayType *splitPointObject = SplitPointObject(intervalArray, intervalCount); StringInfo splitPointString = SplitPointArrayString(splitPointObject, partitionColumnType, partitionColumnTypeMod); appendStringInfo(mapQueryString, RANGE_PARTITION_COMMAND, jobId, taskId, filterQueryEscapedText, partitionColumnName, partitionColumnTypeFullName, splitPointString->data); } else if (partitionType == SINGLE_HASH_PARTITION_TYPE) { ShardInterval **intervalArray = mapMergeJob->sortedShardIntervalArray; uint32 intervalCount = mapMergeJob->partitionCount; ArrayType *splitPointObject = SplitPointObject(intervalArray, intervalCount); StringInfo splitPointString = SplitPointArrayString(splitPointObject, partitionColumnType, partitionColumnTypeMod); appendStringInfo(mapQueryString, HASH_PARTITION_COMMAND, jobId, taskId, filterQueryEscapedText, partitionColumnName, partitionColumnTypeFullName, splitPointString->data); } else { uint32 partitionCount = mapMergeJob->partitionCount; ShardInterval **intervalArray = GenerateSyntheticShardIntervalArray(partitionCount); ArrayType *splitPointObject = SplitPointObject(intervalArray, mapMergeJob->partitionCount); StringInfo splitPointString = SplitPointArrayString(splitPointObject, INT4OID, get_typmodin(INT4OID)); appendStringInfo(mapQueryString, HASH_PARTITION_COMMAND, jobId, taskId, filterQueryEscapedText, partitionColumnName, partitionColumnTypeFullName, splitPointString->data); } /* convert filter query task into map task */ mapTask = filterTask; mapTask->queryString = mapQueryString->data; mapTask->taskType = MAP_TASK; mapTaskList = lappend(mapTaskList, mapTask); } return mapTaskList; } ```

can be refactored as (This is like a pseudocode):

new method ```c static List *MapTaskList(MapMergeJob *mapMergeJob, List *filterTaskList) { List *mapTaskList = NIL; Query *filterQuery = mapMergeJob->job.jobQuery; List *rangeTableList = filterQuery->rtable; ListCell *filterTaskCell = NULL; Var *partitionColumn = mapMergeJob->partitionColumn; Oid partitionColumnType = partitionColumn->vartype; char *partitionColumnTypeFullName = format_type_be_qualified(partitionColumnType); int32 partitionColumnTypeMod = partitionColumn->vartypmod; char *partitionColumnName = --FindPartitionColumnName()--; foreach(filterTaskCell, filterTaskList) { Task *filterTask = (Task *) lfirst(filterTaskCell); --HandlePartitionType(filterTask, mapMergeJob->partitionType);-- mapTask = **ConvertFilterQueryIntoMapTask()**; mapTaskList = lappend(mapTaskList, mapTask); } return mapTaskList; } static void HandlePartitionType(Task *filterTask, PartitionType partitionType){ uint64 jobId = filterTask->jobId; uint32 taskId = filterTask->taskId; /* wrap repartition query string around filter query string */ StringInfo mapQueryString = makeStringInfo(); char *filterQueryString = filterTask->queryString; char *filterQueryEscapedText = quote_literal_cstr(filterQueryString); if (partitionType == RANGE_PARTITION_TYPE) { --HandleRangePartitionType();-- } else if (partitionType == SINGLE_HASH_PARTITION_TYPE) { --HandleSingleHashPartitionType();-- } else { --HandleGenericPartitionType();-- } } ```

I believe there is no limit so as to how many lines a function should be at maximum, but it should be easy to follow the function. However 100 lines for a function will probably be almost always too long. When we see a logical block, it is an opportunity to create a function. With more functions, we also prevent duplicated code. We have a more self documented code, which is easier to follow.

Long Files

Number of lines sorted in descending order:

find . -name "*.c" -type f -exec wc -l {} + | sort -rn | head -20
 126357 total
   7955 ./src/backend/distributed/utils/ruleutils_12.c
   7954 ./src/backend/distributed/utils/ruleutils_11.c
   5641 ./src/backend/distributed/planner/multi_physical_planner.c
   4293 ./src/backend/distributed/planner/multi_logical_optimizer.c
   4077 ./src/backend/distributed/utils/metadata_cache.c
   3729 ./src/backend/distributed/executor/adaptive_executor.c
   3495 ./src/backend/distributed/commands/multi_copy.c
   3204 ./src/backend/distributed/planner/multi_router_planner.c
   3036 ./src/backend/distributed/executor/multi_task_tracker_executor.c
   2306 ./src/backend/distributed/planner/multi_logical_planner.c
   2105 ./src/backend/distributed/executor/multi_router_executor.c
   2014 ./src/backend/distributed/planner/relation_restriction_equivalence.c
   1919 ./src/backend/distributed/planner/query_pushdown_planning.c
   1814 ./src/backend/distributed/utils/node_metadata.c
   1765 ./src/backend/distributed/planner/distributed_planner.c
   1755 ./src/backend/distributed/planner/recursive_planning.c
   1565 ./src/backend/distributed/planner/shard_pruning.c
   1470 ./src/backend/distributed/planner/insert_select_planner.c
   1466 ./src/backend/distributed/master/master_metadata_utility.c

With very long files, there is a lot of logic in a single file, so it is harder to follow the code, or find something. I think we should separate very long files into multiple files so that it is more clear what each file contains. There is certainly not something like X lines is too much but when a file contains tens of methods, it is just hard to follow.

I think we can have more folders. For example for each executor we can have a folder.

./executor/adaptive_executor/
./executor/adaptive_executor/x.c
./executor/adaptive_executor/y.c
./executor/adaptive_executor/z.c
./executor/task_tracker/
./executor/task_tracker/x.c
./executor/task_tracker/y.c
./executor/task_tracker/z.c

I wonder what people think about this?

marcocitus commented 4 years ago

I don't find file size problematic. I like scrolling to quickly see the structure of the code. What does bother me is poor categorization/structure. e.g. a ton of widely used planner utility function sit in multi_physical_planner.c, multi_logical_planner.c, which makes the structure hard to follow. I am also not sure what the division between metadata_cache.c and master_metadata_utility.c is.

adaptive_executor.c is reasonably coherent as it encapsulates all of distributed query execution (incl. data structures), but multi_copy.c handles many subtly different COPY implementations, some of which are highly obscure (e.g. CopyFromWorkerNode). Pulling that one apart would make a lot of sense.

Poor structure correlates with the age of the code as we added small changes over time or made private functions public and then generalised them into utility functions that became detached from the original file, but we leave them in the same place because then the reviewer can see the changes. The files that bother me the most are also the oldest, including files I originally wrote myself (like multi_copy.c).

For functions, it's also more about structure to me and whether the code helps me understand what it does. I don't find BuildSubPlanResultQuery (~160 lines) hard to follow. I can learn a lot about the data structures just by looking at that function. I do find BuildJobTree (~170 lines) and MapTaskList (~90 lines) very hard to follow. Part of this also comes down to effective use of comments (and alignment).

The most important thing about functions is naming/making the code understandable. HandlePartitionType does not help me understand what the code does. I find not using a function (seeing what the code does) better than a function with a poor name, although a poor name can be forgiven by a good comment. In PostgreSQL code >300 line functions are pretty common, but long functions without comments are very rare.

Part of it also comes down to C. Not being able to return data structures without having to think about memory allocation, and not being able to nicely encapsulate changes to a data structure, often leads to longer functions than in other languages. Another driving factor is short line length combined with descriptive variable names.

SaitTalhaNisanci commented 4 years ago

I guess about files I agree that structuring matters the most. In general a long file contains many different functions, which can be grouped together in some ways(where each group is a file).

Definitely a function with a poor name is not good ( In the example I havent thought about the names). I think that we should always try to name functions nicely so that it is easy to understand what it does(well not just functions, everything basically). When the function is long, it is hard to name it, because it does many things. Writing a comment that forgives a bad naming creates one more mapping in my mind because I need to hover over the function to see what it does to understand it.

I think that it is relatively easy to understand BuildSubPlanResultQuery because it doesnt do a lot of things even though it has many lines. I think another problem with long functions is that I can lose what I am looking at. What does what I am looking at do? If we have a small function then I can just look at its name and understand it (assuming the name is chosen nicely).

I dont understand how a >300 line function is ever useful :)

I would say that small functions with a well chosen names are much easier to follow then long functions. Would you agree with this?

Also writing small functions is closer to Single Responsibility Principle. I think that when we create functions, we see similar parts easily within long functions, or between functions, which results in better design with less duplicate code.