SOM-Research / metaScience

Online service for analyzing research profiles of scientists and conferences
12 stars 1 forks source link

Potential clashing problem in VenueAuthorCollaborationServlet #51

Closed jlcanovas closed 8 years ago

jlcanovas commented 8 years ago

The query used in the class VenueAuthorCollaborationServlet to get the data may drive to clashing problem. The following screenshot shows the code of the query involved in the issue (see line 9 in the example):

clash

Coauthors are identified (and then grouped) according to the multiplication of their author identifiers. Given this situation, the coauthors identified by "2" and "3" would be identified by "6", which would be the same as the coauthors identified by "1" and "6".

The real code line is located here:

https://github.com/SOM-Research/metaScience/blob/5f1c2e1b8e6b67238001c2ed71289ec5d9285391/web-server/metaScience/src/som/metascience/VenueAuthorCollaborationServlet.java#L82

valeriocos commented 8 years ago

It's a known problem. It may happen in theory, but (IMHO) it's really unlikely to happen in practice. Anyway, I can try to find another solution in SQL (I hope the calculation time won't increase), otherwise the alternative is do it in Java.

jlcanovas commented 8 years ago

As we need to identify uniquely each pair of authors, I'd propose to rely on string-based identifiers instead of number-based ones.

For instance, connection_id could be calculated as a string composed of the string representation of each author id (ordered from greater to lower) and separated by a hyphen. This CONCAT command could do it:

CONCAT(GREATEST(source_authors.author_id, target_authors.author_id), "-", LEAST(source_authors.author_id, target_authors.author_id)) as connection_id

This way, the connection for authors identified by 2 and 3 will be "3-2" (for the two directions). As author identifiers are unique, the concatenation (with an hyphen) should also be unique.

I've done some test queries and the calculation time seems to be the same.

I can take care of this, change the WAR and re-deploy.

valeriocos commented 8 years ago

Grouping on an int is supposed to be faster than grouping on a varchar. What I would propose is to change the query by adding another connection_id (and keeping the previous one):

...
            SELECT source_authors.author AS source_author_name, source_authors.author_id AS source_author_id,
                        target_authors.author AS target_author_name, target_authors.author_id AS target_author_id,
                        COUNT(*) AS relation_strength, source_authors.author_id * target_authors.author_id AS connection_id1,
                        CONCAT(GREATEST(source_authors.author_id, target_authors.author_id), LEAST(source_authors.author_id, target_authors.author_id)) as connection_id2
            FROM (
                SELECT pub.id AS pub, author, author_id
                FROM dblp_pub_new pub
                JOIN dblp_authorid_ref_new airn
                ON pub.id = airn.id
                WHERE source = 'icse'
                AND pub.type = 'inproceedings'
            ) AS source_authors
            JOIN (
                SELECT pub.id AS pub, author, author_id
                FROM dblp_pub_new pub
                JOIN dblp_authorid_ref_new airn
                ON pub.id = airn.id
                WHERE source = 'icse'
                AND pub.type = 'inproceedings'
            ) AS target_authors
            ON source_authors.pub = target_authors.pub 
            AND source_authors.author_id <> target_authors.author_id
            GROUP BY source_authors.author_id, target_authors.author_id
        ) AS x
        GROUP BY connection_id1, connection_id2
    ) AS connections
...
jlcanovas commented 8 years ago

Yes, it's faster but the problem may appear. I've been doing some tests comparing int-based and string-based identifiers (each individual time is calculated as the average of five consecutives calls) and the time difference is around 0.013 secs. I'd say we can assume that delay to avoid the problem.

comparison

Your proposal for connection_id2 concats two numbers but does not include a separator, which may drive to a clashing problem as well (e.g., authors with "1" and "23" will be represented as "123", which would be the same as for the authors "12" and "3"). The meaning of the separator is to avoid this situation.

valeriocos commented 8 years ago

I think the clashing problem is really unlikely to happen, however I also think that we have spent enough time on this issue. Feel free to take the decision you want.

jlcanovas commented 8 years ago

Ok, I'll deploy the String-based version