Bergvca / string_grouper

Super Fast String Matching in Python
MIT License
362 stars 76 forks source link

Group Connectivity Visualization may reveal other possible representatives #36

Closed ParticularMiner closed 3 years ago

ParticularMiner commented 3 years ago

Hi @Bergvca , @justasojourner

I'm no expert in graph visualization, but take a look at these graph drawings of two of string-grouper's groups of the 1st 50 000 records of the sec__edgar_company_info.csv file:

group0 group1

Images were rendered using Gephi 0.9.2.

It may be that one other group representative to be considered is the string with the highest number of matches (in graph theory this is the node of highest degree). What do you think?

Bergvca commented 3 years ago

Hi @ParticularMiner,

First of all, cool visualisations!

Do you mean the string with the highest number of matches before calling the "connected_components" function? After that they should all have the same number of matches right?

ParticularMiner commented 3 years ago

Thanks @Bergvca. Gephi turns graph visualization into more of a task in aesthetics than science. Still I find it quite useful.

Sorry for not being clear: indeed I meant the string with the highest number of direct matches as specified by _matches_list.

Actually, my suggestion of this other group rep is now mute because after symmetrizing the matches-list, many of the strings in one group turned out to have the same number of direct matches (as you probably guessed), making the choice of rep sometimes ambiguous. The centroid, on the other hand almost always removes this ambiguity, making it the natural choice.