Sparsh1212 / gsocanalyzer

A blazingly fast tool to analyze all the selected organizations in Google Summer of Code in the form of graphical analytics.
MIT License
75 stars 39 forks source link

make merge.py and merge 16 orgs into 6 #74

Closed letsintegreat closed 2 years ago

letsintegreat commented 2 years ago

I've added a script (merge.py) (#73) to automate the process of merging duplicate organizations, which are reported manually by users.

While testing the script, I have merged these duplicates using the script and cross-checked the result manually:


And I found out that, everything's working as expected. :)

And don't worry I have left some duplicates so that you can review the PR. ;) Here:

Try to merge these ^ duplicates using the script.

How to use the script?

Run the file, enter the number of duplicate entries including the original one.

Now enter the exact names of the duplicates.

Note that the first entered name will be the name of the final merged organization. If one or more org names could not be found, the program will end abruptly, prompting the names of the entered orgs which are not present in finalData.json

How each of the data is sanitized in the final merged org?

name

Copied from the first entered org.

url

Copied from the first entered org.

cat

Copied from the first entered org.

tech

Union of all the entered orgs.

top

Union of all the entered orgs.

year

Union of all the entered orgs.

project

Extra care has been taken while dealing with copying the number of projects data. For each year, if only one org has non-zero number of projects, it is copied. If, however, more than one org has non-zero number of projects, and the number of projects is not same for all the orgs, then a conflict is raised, and the user is asked to enter the correct number of projects manually for that particular year. (This is likely not gonna happen, as duplicate entries are because of different names in different years, but just in case.) (For the 6 orgs that I merged in this PR, none of them raised a conflict.)

letsintegreat commented 2 years ago

@Sparsh1212

Please capitalize the first letter of every input prompt. Ditto for all other input prompts.

Done.

I noticed that a lot of git diff changes are occurring although we are correcting just 5-6 orgs. This is because your script's order of keys of objects is different than the current order of keys.

Yes, that, and also the script was deleting those organizations and appending the new organization at the end of the file. I have now edited the script to add the newly formed organization where the first inputted org was located.

I'll suggest keeping it the same so that we do not see unwanted changes.

Actually, before Python 3.6, dictionaries were unordered, which means the order of their keys was not fixed. After the next update, dicts were made ordered, so they can remember the order of their keys. (So, when making changes locally, if you want to keep the order of keys same, pls install the latest version of python.) This time I have used Python 3.10 to make changes in finalData.json unlike the last time. So the ordering of keys is how you described.

However, even after making these changes, git diff is showing all those lines, I don't know why.