Closed peti2001 closed 3 years ago
I don't really have a big problem with Python. I'll look for ways to speed it up, if that's the blocker. You can only get so far doing that, however. WRT to current requirements.
@peti2001: thanks for inviting me to give input!
Personally, didn't have many problems with the Python implementation, even though I don't deal with Python code and dependencies on a daily basis. I'm not sure how big of a factor the chosen language plays in the perceived slowness.
Thumbs up for the initial definition of "fast" :+1:
As far as recognizing "similar" emails go, I think git's built-in mailmap feature would be pretty much on-topic for that (and even it if needs to be extended to non-git platforms, the idea and the format is already established, so that might be one less wheel to reinvent). /cc: @JJ
What I'd be happy to see is that the extractor should be able make a distinction between canonical repos and its forks. I often get credited highly for the forks of my main repos where I almost never contribute, while I get low points for the repo where 99%+ of my actual work is. I guess overall it's the same total results due to being credited to each commit ID only once, but the ratios are often weird between repos. I believe this might be a case on #132.
I suggest selecting the language after all other questions are clarified. Theoretically, you could choose any language. Golang might be the preferable choice for the use case. But if you can't find the appropriate libraries for the model, or you find poorly maintained libraries, or libraries that don't meet all criteria, it unnecessarily complicates things late in the process. I suggest beginning with the model, then thoroughly screening which libraries are available for which language and which libraries might have to be implemented. This allows for an educated guess of which language might not only be performant but also cost-efficient for implementation and maintenance. Btw, Cython is also a low-level-performant choice.
I think you're looking for a streamable format. It doesn't matter if it's text-based, binary, or compressed. It only has to be streamable. So the server won't have to cache it in memory before processing. Theoretically, this also applies client-side if you can post it to the server during the extraction. There are many streamable formats and there will be different library choices for each language. The next-best text-based format would be yaml, which allows better streaming due to line separation and a guarantee of completion mid-stream after every line. Afaik, JS has the most streamable libraries. Golang has native streaming support as well but not as many libraries.
I'd be interested in how you would like to process the git history without a checkout. I agree that a checkout is everything but performant. But a manual emulation means more implementation effort and also maintenance when the git standard changes. Golang also has a library for a virtual file system (vfs) that uses the same interface as the normal file system. But this doesn't work with external binaries. Only golang code can use the interface. Fortunately, there is a native git implementation for golang (go-git). Theoretically, git could also be processed in a stream. There is an open issue regarding their stream implementation, though.
Another optimization point I see is caching. The first extraction might take a lot of time but keeping it up to date shouldn't take the same amount of time. Theoretically, you could cache the stream and limit the checkout history or stream if the history up to that point of time has identical commit metadata. But how to recognize force-commits and other changes in the history that trace back before the date of the last extraction? If done properly, this could save a lot of processing time.
According to the @peti2001 invite:
hard to install, requires Python
I disagree - as a user, don't have any problems with Python (except that i, probably, would not participate in development with this environment). "install" part of the readme is great - i'm always trying to use language-agnostic containers and they are working fine for me here.
too big output, JSON takes too much memory and CPU to process
If i was making a similar app (repo info extractor; and a scoring algo itself too), i would consider at least 4 possible ways for refactoring:
1. Put a docker compose/stack configuration with a few replicas for the Go/ReactPHP app, with some changes to the output data: it must be rendered in a raw format without json decode / encode / object-tree building, just file offsets, for example. It opens a way to use concurrent processing by offsets with no need to perform an additional JSON parsing stage each time. We can also safely cut this data into pieces for the further processing. The main problem i see here is to properly handle a shared context, if there are some requirements/rules for the score calculator (your private algorithm), which affects different repository "pieces" within a single iteration, i don't know the implementation and can't tell much.
All of these replicas also must take only a part of the input data (commit range?).
So, you can convert output JSON to the less-readable, but faster format - offsets, for example, to be able to parallelize them later, using a cluster of low-profile Go/ReactPHP microservices (1 container = 1 instance = 1 thread), and then combine results to a report for the end-user.
2. A single Go app. Just input data, goroutines & channels and let the runtime decide, which execution flow to use for the different application parts (for a powerful server i guess).
3. Socket streaming (for your private environment only) - instead of building a fat JSON file, an "adapter" will start to emit data to your other services. They will read and buffer it as they need (memory threshold). ReactPHP example: https://github.com/reactphp/http#requeststreaming (something similar in Go).
4. Do not rewrite the code base, create a script that converts a JSON file to the text file :) to just met the "parsable line by line" requirement.
About overall processing time - i would make a "real" score calculator (which takes time to do its job) and a "fallback", to make a quick preview / estimate score changes before the main one completes the full logic pass.
So, i would start with:
1. An MVP - single script with "ported" logic from the python code (maybe some parts can be already parallelized without any trouble). 2. A benchmark, how things are faster than the current python implementation (personally, i love the built-in testing capabilities in Go, they are really cool).
Thank you for the great suggestions, I'll try to answer all of them :).
@JJ
@ferki
.mailmap
we use this already. We get the list of emails of the repo by git shortlog -se
which considers .mailmap
. When we have this list we find similar emails and names. So if you have .mailmap
it will be more accurate.
Forks are out of the scope. The repo_info_extractor
just extracts the data. It doesn't know whether it is a fork or not. This logic is on the server-side. That is the next refactoring after repo_info_extractor
.
Many of you mentioned streaming. It is a great idea. we will definitely consider it.
Now one of the biggest bottlenecks is the disk I/O. We go through all the commits, and all the changes are written to the disk. For large repos it is a lot of disk I/O. Keeping everything in memory helps a lot. For the script output, we have to know how many lines changed and detect the imported libraries. For this, we don't have to construct the whole file. This is just an idea yet, but I hope we can put together a working prototype.
Which one would be more important for you? Faster script 🚀 GUI 👀
Faster script, definitely.
i think detect_language.py
should be improved, because i never used CoffeeScript and im on the top 50 from Brazil.
@EduApps-CDG
i think
detect_language.py
should be improved, because i never used CoffeeScript and im on the top 50 from Brazil.
It looks like in this commit https://github.com/ArttiDev/php-node-project/commit/0b96956ae13ac0c914e81713fb2d179ea4b0fe46 a bunch of files have been updated and some of them were coffee script, this is why you got score for it.
I know these are just libraries but it is hard to detect which code is written by you or just copied from a library.
I know these are just libraries but it is hard to detect which code is written by you or just copied from a library.
I understood, in this commit node_modules
was not in my .gitignore
file. so thats why i earn score...
The current solution has quite a few problems. But most importantly it is hard to use. It is written in Python and for developers who have no Python experience, it is not very convenient to use.
Problems with the current solution
Requirements:
Nice to have
If you have any suggestions, problems with the current implementation, please share.