TuGraph-family / tugraph-analytics

TuGraph Analytics is a distributed graph compute engine.
https://tugraph-analytics.github.io
Apache License 2.0
634 stars 71 forks source link

Implement the Jaccard Similarity algorithm #42

Open Loognqiang opened 1 year ago

Loognqiang commented 1 year ago

Currently, GeaFlow has implemented many graph computing algorithms, but there are still some graph algorithms that have not been implemented. Please implement the Jaccard Similarity algorithm.

kaori-seasons commented 1 year ago

@pengzhiwei2018 @Loognqiang Hi, community, by searching for information, I found that there are many derivative versions of Jaccard Similarity algorithm, such as LSH algorithm (pairwise calculation of Jaccard similarity), and capturing Jaccard similarity (FuzzyWuzzy string matching). Is it possible to communicate further, explaining the scenario and purpose of applying this algorithm?

Loognqiang commented 1 year ago

@complone Hi, Glad to see you're interested in this algorithm. The Jaccard similarity measures the similarity and differences between limited sample sets using the Jaccard coefficient (or Jaccard index). The higher the Jaccard coefficient, the higher the similarity between the samples. It's similar to the LSH algorithm you described. The input of algorithm is two vertices of graph, and the output is Jaccard coefficient.

kaori-seasons commented 1 year ago

@complone Hi, Glad to see you're interested in this algorithm. The Jaccard similarity measures the similarity and differences between limited sample sets using the Jaccard coefficient (or Jaccard index). The higher the Jaccard coefficient, the higher the similarity between the samples. It's similar to the LSH algorithm you described. The input of algorithm is two vertices of graph, and the output is Jaccard coefficient.

Ok thanks for your reply. Since I am not yet fully familiar with this algorithm, I plan to spend some time sorting out the scheme first, and will submit a PR if there is progress

Loognqiang commented 1 year ago

Thanks for your efforts, I am looking forward to your PR submission!

kaori-seasons commented 1 year ago

The simple implementation of the current algorithm is as follows, the similarity between two nodes can be calculated by calling the apply method

public Double apply(CharSequence left, CharSequence right) {
        if (left == null || right == null) {
            throw new IllegalArgumentException("Input cannot be null");
        }
        return Math.round(calculateJaccardSimilarity(left, right) * 100d) / 100d;
    }

    /**
     * Calculates Jaccard Similarity of two character sequences passed as
     * input. Does the calculation by identifying the union (characters in at
     * least one of the two sets) of the two sets and intersection (characters
     * which are present in set one which are present in set two)
     * 
     * @param left first character sequence
     * @param right second character sequence
     * @return index
     */
    private Double calculateJaccardSimilarity(CharSequence left, CharSequence right) {
        Set<String> intersectionSet = new HashSet<String>();
        Set<String> unionSet = new HashSet<String>();
        boolean unionFilled = false;
        int leftLength = left.length();
        int rightLength = right.length();
        if (leftLength == 0 || rightLength == 0) {
            return 0d;
        }

        for (int leftIndex = 0; leftIndex < leftLength; leftIndex++) {
            unionSet.add(String.valueOf(left.charAt(leftIndex)));
            for (int rightIndex = 0; rightIndex < rightLength; rightIndex++) {
                if (!unionFilled) {
                    unionSet.add(String.valueOf(right.charAt(rightIndex)));
                }
                if (left.charAt(leftIndex) == right.charAt(rightIndex)) {
                    intersectionSet.add(String.valueOf(left.charAt(leftIndex)));
                }
            }
            unionFilled = true;
        }
        return Double.valueOf(intersectionSet.size()) / Double.valueOf(unionSet.size());
    }

The following algorithm can be used to calculate the similarity of multi-segment strings between two vertices:

 public static double computeJaccardsCoefficient(Point p1, Point p2) {
        Collection<String> tweet1 = new TreeSet<String>(Arrays.asList(p1
                .getTweet().split(" ")));

        Collection<String> tweet2 = new TreeSet<String>(Arrays.asList(p2
                .getTweet().split(" ")));

        Collection<String> intersectionOfTweets = new TreeSet<String>(
                tweet1);
        intersectionOfTweets.retainAll(tweet2);

        Collection<String> unionOfTweets = new TreeSet<String>(tweet1);
        unionOfTweets.addAll(tweet2);

        double jaccardsCoefficient = (double) intersectionOfTweets.size()
                / (double) unionOfTweets.size();
        return jaccardsCoefficient;
    }

@Loognqiang However, there is still a problem. In the current udag calculation paradigm, since the flow graph calculation paradigm must be followed, after each algorithm iteration, the values ​​​​of the current node and surrounding nodes need to be updated.

So I understand whether the current requirement is to calculate the node similarity between the current node and the surrounding nodes, and then update the calculated values ​​​​to the surrounding nodes to complete this iteration?

KevinLi724 commented 1 year ago

@kaori-seasons Very glad to see your submission. The core idea of jaccard Similarity algorithm is the same, but there are some differences in application. In the application of graph calculation, jaccard similarity is characterized by calculating the number of all common neighbor nodes between two vertexs. It is required to traverse the entire graph to find the common neighbor nodes of two target points. However, in your submitted algorithm, the jaccard similarity of the two charsequences is calculated, and the scene in the graph calculation is different. About the application of graph algorithm, we have many implementation examples. I hope we can have further discussion !

kaori-seasons commented 1 year ago

@KevinLi724 According to the information you gave, after collecting relevant information, I sorted out an example of the application of graph algorithms.

graph

As shown in the figure, the graph is a directed graph, and the similarity between vertices can be calculated according to the direction.

Calculate the similarity between vertex 1 and vertex 2:

out: V(1)={2,4,5,6},V(2)={3,4}

image

in:V(1)={4},V(2)={1,4}

image

both:V(1)={2,4,5,6},V(2)={1,3,4} image

The above measures the similarity of the specified two vertices for adjacent points in different directions. In graph data calculation, generally speaking, only vertices under a certain label are used for similarity measurement. At this time, we can choose to measure all dimensions, or select specified dimensions for comparison (that is, filter by edge type). Meet the business scenario requirements we need.