Option `--multi-thread` does not make any difference for relatively small repos

davbeek / gitinspectorgui

0 stars 0 forks source link

Option `--multi-thread` does not make any difference for relatively small repos #38

Open davbeek opened 3 weeks ago

davbeek commented 3 weeks ago

In the current implementation, having --multi-thread active or not does not make any relevant difference for student repos.

davbeek commented 3 weeks ago

I will set the default to --no-multi-thread for now, because it currently makes no difference and it simplifies testing and debugging.

davbeek commented 3 weeks ago

There was no difference for student repos, however for repo main, which has the website of 4TC00, setting --multi-thread reduces execution time from 36s to 20s.

Alberth289346 commented 3 weeks ago

Not a surprise I think, getting info from Git is I/O bound. Speed is limited by disk or network bandwidth, rather than CPU data processing.

davbeek commented 3 weeks ago

Indeed. Only with option --blame-history that generates separate blame info tables per file per commit, for each commit that changes the file, we can get huge html files, where execution speed is limited by CPU html manipulations.

Alberth289346 commented 3 weeks ago

At the time we used Python to generate output, the standard tactic was to make lists of lines, and only at the end combine them into a file while writing the output like

for line in lines:
  write(line)
  write("\n")

(ie thus not write("\n".join(lines)) since that builds a huge intermediate string)

Alberth289346 commented 3 weeks ago

Modern Python are likely better at avoiding these things though.

Try running a profiler, to find the hot-spots

davbeek commented 3 weeks ago

We use BeautifulSoup (bs4), which uses internal data structures to process html, without actually using html. Only at the very end, html is generated by means of html = str(soup), where soup is the bs4 object that represents the html.