Murali-group / Beeline

BEELINE: evaluation of algorithms for gene regulatory network inference
GNU General Public License v3.0
177 stars 53 forks source link

Faster way of performing AUC Evaluations on larger datasets. #126

Open FaizalJnu opened 3 months ago

FaizalJnu commented 3 months ago

Description:

While working with Beeline dataset as a part of GSoC. I encountered difficulty running the evaluation pipeline to generate AUC scores. The file in question was computeDGAUC.py in the BLEval folder. Therefore I've implemented an optimized version of the computeScores function that significantly improves performance and efficiency, especially for large genetic networks. Here's a comparison of the old and new implementations:

Previous Implementation:

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[(trueEdgesDF['Gene1'] == key.split('|')[0]) &
           (trueEdgesDF['Gene2'] == key.split('|')[1])])>0:
            TrueEdgeDict[key] = 1

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[((trueEdgesDF['Gene1'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene2'] == key.split('|')[1])) |
                      ((trueEdgesDF['Gene2'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene1'] == key.split('|')[1]))]) > 0:
        TrueEdgeDict[key] = 1

New Implementation:

true_edges = set(map(tuple, trueEdgesDF[['Gene1', 'Gene2']].values))
for edge in edge_generator:
    key = '|'.join(edge)
    TrueEdgeDict[key] = int(edge in true_edges or (not directed and edge[::-1] in true_edges))

Key Improvements:

Why It's Better:

These optimizations maintain the same functionality while providing substantial performance enhancements, making our genetic network analysis more efficient and capable of handling larger datasets.