Refactor Feateng1 to see the feature dependency easier.

jylee-bcm commented 1 month ago

Previously, the code of feature engineering part was written in a single function with lots of dependency. This PR split the function into several pieces so that you can see which the feature dependency easier.

Output Test

I validated with the demo data that the output is identical.

Speed Test

I concerned the speed could be degenerated due to the increased number of steps But it was not substantial. (2m 15s -> 2m 25s)

Evident Dependency

We can see simpler dependency than before.

def f1(row):
    return getAnnotateInfoRow_3_1(row, genomeRef)

def f2(row):
    if "curate" not in moduleList:
        return row
    return getAnnotateInfoRow_3_2(row, decipherSortedDf)

def f3(row):
    if "conserve" not in moduleList:
        return row
    return getAnnotateInfoRow_3_3(row, gnomadMetricsGeneSortedDf)

def f4(row):
    if "curate" not in moduleList:
        return row
    return getAnnotateInfoRow_3_4(row, omimGeneSortedDf)

def f5(row):
    if "curate" not in moduleList:
        return row
    return getAnnotateInfoRow_3_5(row, clinvarGeneDf, clinvarAlleleDf)

def f6(row):
    if "curate" not in moduleList:
        return row
    return getAnnotateInfoRow_3_6(
        row, hgmdHPOScoreDf
    )

annotateInfoDf = vepDf.apply(f1, axis=1, result_type='expand')
df = annotateInfoDf.apply(f2, axis=1, result_type='expand')
annotateInfoDf[df.columns] = df
df = annotateInfoDf.apply(f3, axis=1, result_type='expand')
annotateInfoDf[df.columns] = df
df = annotateInfoDf.apply(f4, axis=1, result_type='expand')
annotateInfoDf[df.columns] = df
df = annotateInfoDf.apply(f5, axis=1, result_type='expand')
annotateInfoDf[df.columns] = df
df = annotateInfoDf.apply(f6, axis=1, result_type='expand')
annotateInfoDf[df.columns] = df
return annotateInfoDf

jylee-bcm commented 1 month ago

You can see the diff better here:

https://www.diffchecker.com/F6kKbny3/

hyunhwan-bcm commented 1 month ago

Running speed might be not a big problem, but I wonder that the memory usage has a drastically change (in a good way)

hyunhwan-bcm commented 1 month ago

You can see the diff better here:

https://www.diffchecker.com/F6kKbny3/

Sorry I don't think this diff is much better

jylee-bcm commented 1 month ago

Running speed might be not a big problem, but I wonder that the memory usage has a drastically change (in a good way)

Sorry I already cleaned the trace. But it is not expected to improve any performance, because this change was for readability, not the performance.

hyunhwan-bcm commented 1 month ago

Running speed might be not a big problem, but I wonder that the memory usage has a drastically change (in a good way)

Sorry I already cleaned the trace.

But it is not expected to improve any performance, because this change was for readability, not the performance.

That was what you first mentioned - even a simple but different loop structure it could optimize memory usage. You stated this multiple times, not one so I believe checking the memory should happen here.

LiuzLab / AI_MARRVEL