Closed jlcanovas closed 8 years ago
It's a known problem. It may happen in theory, but (IMHO) it's really unlikely to happen in practice. Anyway, I can try to find another solution in SQL (I hope the calculation time won't increase), otherwise the alternative is do it in Java.
As we need to identify uniquely each pair of authors, I'd propose to rely on string-based identifiers instead of number-based ones.
For instance, connection_id
could be calculated as a string composed of the string representation of each author id (ordered from greater to lower) and separated by a hyphen. This CONCAT command could do it:
CONCAT(GREATEST(source_authors.author_id, target_authors.author_id), "-", LEAST(source_authors.author_id, target_authors.author_id)) as connection_id
This way, the connection for authors identified by 2 and 3 will be "3-2" (for the two directions). As author identifiers are unique, the concatenation (with an hyphen) should also be unique.
I've done some test queries and the calculation time seems to be the same.
I can take care of this, change the WAR and re-deploy.
Grouping on an int is supposed to be faster than grouping on a varchar. What I would propose is to change the query by adding another connection_id (and keeping the previous one):
...
SELECT source_authors.author AS source_author_name, source_authors.author_id AS source_author_id,
target_authors.author AS target_author_name, target_authors.author_id AS target_author_id,
COUNT(*) AS relation_strength, source_authors.author_id * target_authors.author_id AS connection_id1,
CONCAT(GREATEST(source_authors.author_id, target_authors.author_id), LEAST(source_authors.author_id, target_authors.author_id)) as connection_id2
FROM (
SELECT pub.id AS pub, author, author_id
FROM dblp_pub_new pub
JOIN dblp_authorid_ref_new airn
ON pub.id = airn.id
WHERE source = 'icse'
AND pub.type = 'inproceedings'
) AS source_authors
JOIN (
SELECT pub.id AS pub, author, author_id
FROM dblp_pub_new pub
JOIN dblp_authorid_ref_new airn
ON pub.id = airn.id
WHERE source = 'icse'
AND pub.type = 'inproceedings'
) AS target_authors
ON source_authors.pub = target_authors.pub
AND source_authors.author_id <> target_authors.author_id
GROUP BY source_authors.author_id, target_authors.author_id
) AS x
GROUP BY connection_id1, connection_id2
) AS connections
...
Yes, it's faster but the problem may appear. I've been doing some tests comparing int-based and string-based identifiers (each individual time is calculated as the average of five consecutives calls) and the time difference is around 0.013 secs. I'd say we can assume that delay to avoid the problem.
Your proposal for connection_id2
concats two numbers but does not include a separator, which may drive to a clashing problem as well (e.g., authors with "1" and "23" will be represented as "123", which would be the same as for the authors "12" and "3"). The meaning of the separator is to avoid this situation.
I think the clashing problem is really unlikely to happen, however I also think that we have spent enough time on this issue. Feel free to take the decision you want.
Ok, I'll deploy the String-based version
The query used in the class VenueAuthorCollaborationServlet to get the data may drive to clashing problem. The following screenshot shows the code of the query involved in the issue (see line 9 in the example):
Coauthors are identified (and then grouped) according to the multiplication of their author identifiers. Given this situation, the coauthors identified by "2" and "3" would be identified by "6", which would be the same as the coauthors identified by "1" and "6".
The real code line is located here:
https://github.com/SOM-Research/metaScience/blob/5f1c2e1b8e6b67238001c2ed71289ec5d9285391/web-server/metaScience/src/som/metascience/VenueAuthorCollaborationServlet.java#L82