Optimize bin/feature.py for memory usage and processing time

hyunhwan-bcm commented 3 months ago

Description

The bin/feature.py script is currently consuming substantial memory and has slow processing times. We need to optimize this script to improve its performance, focusing on pandas usage optimization and reducing processing time for specific functions.

Current Performance Issues

Memory Usage

Pandas consumes more memory than the file size after pd.read_csv operations.
Significant memory increments observed for various dataframe operations.

Time Consumption

The apply function on varDf takes 27.9% of the total time.
The hgmdSymMatch function consumes 65.0% of the total time.

hyunhwan-bcm commented 3 months ago

Annotation file size

[ 204]  .
├── [ 272]  anno_hg19
│   ├── [4.7M]  decipher.csv
│   ├── [164M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
├── [ 272]  anno_hg38
│   ├── [4.7M]  decipher.csv
│   ├── [ 38M]  dgv.csv
│   ├── [1.8M]  gene_clinvar.csv
│   ├── [ 43M]  gene_omim.json
│   ├── [ 13M]  gnomad.v2.1.1.lof_metrics.by_gene.txt
│   └── [4.4M]  omim_alleric_variants.json
└── [8.8K]  feature_stats.csv

The running time profile

Timer unit: 1 s

Total time: 1169.54 s
File: /Users/hyun-hwanjeong/Workspaces/AI_MARRVEL/bin/feature.py
Function: main at line 47

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

   316         1        326.6    326.6     27.9          annotateInfoDf = varDf.apply(f, axis=1, result_type='expand')

   352         1         28.2     28.2      2.4              resDf = annotateInfoDf.apply(f, axis=1, result_type='expand')

   360     55531         27.5      0.0      2.3              omimSymMatch(varObj, omimHPOScoreDf, args.inFileType)

   361     55531        760.6      0.0     65.0              hgmdSymMatch(varObj, hgmdHPOScoreDf)

   428         1          1.2      1.2      0.1      score.to_csv("scores.csv", index=False)

The memory profile

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================

   108  121.605 MiB   35.859 MiB           1       gnomadMetricsGeneDf = pd.read_csv(fileName, sep="\t")

   136  127.469 MiB    5.863 MiB           1           omimHPOScoreDf = pd.read_csv(fileName, sep="\t")

   140  179.078 MiB   51.609 MiB           1           hgmdHPOScoreDf = pd.read_csv(fileName, sep="\t")

   153  185.281 MiB    6.203 MiB           1           clinvarGeneDf = pd.read_csv(fileName, sep=",")

   164  237.117 MiB   50.961 MiB           1               omimGeneList = json.load(f)

   207  604.121 MiB  356.621 MiB           1           dgvDf = pd.read_csv(fileName, sep=",")

   273 1221.754 MiB  661.543 MiB           2           varDf = pd.read_csv(

   304 1703.629 MiB  315.082 MiB       55532           def f(row):

   423 1538.184 MiB   27.973 MiB           1       score = load_raw_matrix(annotateInfoDf)

   425 1537.270 MiB   25.773 MiB           1       score = hgmdCurate(score)

jylee-bcm commented 2 months ago

Can I get an update regarding this issue? The recent PR #61 improved the memory usage and processing time?

LiuzLab / AI_MARRVEL

Optimize bin/feature.py for memory usage and processing time #54