Open davbeek opened 3 weeks ago
I will set the default to --no-multi-thread
for now, because it currently makes no difference and it simplifies testing and debugging.
There was no difference for student repos, however for repo main, which has the website of 4TC00, setting --multi-thread
reduces execution time from 36s to 20s.
Not a surprise I think, getting info from Git is I/O bound. Speed is limited by disk or network bandwidth, rather than CPU data processing.
Indeed. Only with option --blame-history
that generates separate blame info tables per file per commit, for each commit that changes the file, we can get huge html files, where execution speed is limited by CPU html manipulations.
At the time we used Python to generate output, the standard tactic was to make lists of lines, and only at the end combine them into a file while writing the output like
for line in lines:
write(line)
write("\n")
(ie thus not write("\n".join(lines))
since that builds a huge intermediate string)
Modern Python are likely better at avoiding these things though.
Try running a profiler, to find the hot-spots
We use BeautifulSoup (bs4), which uses internal data structures to process html, without actually using html. Only at the very end, html is generated by means of html = str(soup)
, where soup
is the bs4 object that represents the html.
In the current implementation, having
--multi-thread
active or not does not make any relevant difference for student repos.