Faster way of performing AUC Evaluations on larger datasets.

Description:

While working with Beeline dataset as a part of GSoC. I encountered difficulty running the evaluation pipeline to generate AUC scores. The file in question was computeDGAUC.py in the BLEval folder. Therefore I've implemented an optimized version of the computeScores function that significantly improves performance and efficiency, especially for large genetic networks. Here's a comparison of the old and new implementations:

Previous Implementation:

Used nested loops and DataFrame operations for edge lookups
Initialized dictionaries with all possible edges before filling them
Relied on DataFrame filtering for each edge check
Separate logic for directed and undirected cases

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[(trueEdgesDF['Gene1'] == key.split('|')[0]) &
           (trueEdgesDF['Gene2'] == key.split('|')[1])])>0:
            TrueEdgeDict[key] = 1

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[((trueEdgesDF['Gene1'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene2'] == key.split('|')[1])) |
                      ((trueEdgesDF['Gene2'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene1'] == key.split('|')[1]))]) > 0:
        TrueEdgeDict[key] = 1

New Implementation:

Converts DataFrames to sets and dictionaries for faster lookups
Creates dictionaries on-the-fly while iterating through possible edges
Uses set membership and dictionary lookups instead of DataFrame filtering
Unifies logic for directed and undirected cases

true_edges = set(map(tuple, trueEdgesDF[['Gene1', 'Gene2']].values))
for edge in edge_generator:
    key = '|'.join(edge)
    TrueEdgeDict[key] = int(edge in true_edges or (not directed and edge[::-1] in true_edges))

Key Improvements:

Performance: The new version is significantly faster, especially for large datasets, due to the use of more efficient data structures and operations.
Scalability: Performance gains become more pronounced as the size of the input data increases, making it better suited for large-scale genetic network analyses.
Code Readability: The new version is more concise with less repeated code, improving maintainability.
Memory Usage: While it might use slightly more memory upfront, this trade-off results in substantial runtime performance benefits.

Why It's Better:

Faster execution times, especially crucial for large genetic networks
More efficient handling of edge lookups and checks
Better scalability for growing datasets
Improved code structure for easier maintenance and future enhancements

These optimizations maintain the same functionality while providing substantial performance enhancements, making our genetic network analysis more efficient and capable of handling larger datasets.

Murali-group / Beeline

Faster way of performing AUC Evaluations on larger datasets. #126

Description: