Inefficient "partition" variable

skojaku commented 1 year ago

The variable partition is a dict object, with keys and values corresponding to the node ids and group ids, respectively.

https://github.com/BianKov/iterEmb/blob/cbcb87e7ec6ae92c01f91748ced4cf365ee7abbe/iteremb/iteremb/communityDetection.py#L80

Dictionary is inefficient object. A faster, more compact, and convenient object is numpy array, since it is essentially a binary object that can be manipulated with functions implemented in C (with optimizations). So I propose to use numpy array, instead of dict object.

BianKov commented 1 year ago

Okay, we can use numpy arrays. The reason why I always use dictionaries (not only for storing the partitions, but also for storing the coordinate arrays in the embeddings) because this way I can assign the community names (that can also be string this way, not only number) and the position vectors directly to the node names (that can be string) instead of introducing integer node indexes and assign everything to those. For a while, I used node indexes and numpy arrays, but at some point I always found it too complicated to pay attention everywhere to the node name (possibly string) - node index (integer from 0 to numOfNodes-1) conversion. But if you can solve it easily to use numpy arrays even when the user starts from a graph where nodes have string names and may want to easily access the communities/position vectors corresponding to the node names instead of some node indexes, it's totally fine.

skojaku commented 1 year ago

Yes. It is indeed inconvenient. But this little inconvenience pays off when the data is big and not clean. So I'd like to encourage to create a master table that contains node index, together with metadata such as node names. (not necessary this project, since our project is already not in early stage, and we got results.)

Why is index important? Not everything can be an "index". For instance, an important requirement is that index has to be unique. Node names can be duplicated just by errors, for instance. So our program can easily break if such an error occurs, and we have to make sure that every node has a unique index. I know that there are some network data in which node names are used as index. But this should be avoided from the perspective of quality assurance.

Furthermore, using "string" should be avoided as much as possible in terms of the computing. Integer is much more light weight and easily accessible by computers, and this little coding effort makes a big speed up in the end (just like the one I did with TREXPIC).

BianKov / iterEmb

Inefficient "partition" variable #10