Closed hyunhwan-bcm closed 2 months ago
[ 204] .
├── [ 272] anno_hg19
│ ├── [4.7M] decipher.csv
│ ├── [164M] dgv.csv
│ ├── [1.8M] gene_clinvar.csv
│ ├── [ 43M] gene_omim.json
│ ├── [ 13M] gnomad.v2.1.1.lof_metrics.by_gene.txt
│ └── [4.4M] omim_alleric_variants.json
├── [ 272] anno_hg38
│ ├── [4.7M] decipher.csv
│ ├── [ 38M] dgv.csv
│ ├── [1.8M] gene_clinvar.csv
│ ├── [ 43M] gene_omim.json
│ ├── [ 13M] gnomad.v2.1.1.lof_metrics.by_gene.txt
│ └── [4.4M] omim_alleric_variants.json
└── [8.8K] feature_stats.csv
Timer unit: 1 s
Total time: 1169.54 s
File: /Users/hyun-hwanjeong/Workspaces/AI_MARRVEL/bin/feature.py
Function: main at line 47
Line # Hits Time Per Hit % Time Line Contents
==============================================================
316 1 326.6 326.6 27.9 annotateInfoDf = varDf.apply(f, axis=1, result_type='expand')
352 1 28.2 28.2 2.4 resDf = annotateInfoDf.apply(f, axis=1, result_type='expand')
360 55531 27.5 0.0 2.3 omimSymMatch(varObj, omimHPOScoreDf, args.inFileType)
361 55531 760.6 0.0 65.0 hgmdSymMatch(varObj, hgmdHPOScoreDf)
428 1 1.2 1.2 0.1 score.to_csv("scores.csv", index=False)
Line # Mem usage Increment Occurrences Line Contents
=============================================================
108 121.605 MiB 35.859 MiB 1 gnomadMetricsGeneDf = pd.read_csv(fileName, sep="\t")
136 127.469 MiB 5.863 MiB 1 omimHPOScoreDf = pd.read_csv(fileName, sep="\t")
140 179.078 MiB 51.609 MiB 1 hgmdHPOScoreDf = pd.read_csv(fileName, sep="\t")
153 185.281 MiB 6.203 MiB 1 clinvarGeneDf = pd.read_csv(fileName, sep=",")
164 237.117 MiB 50.961 MiB 1 omimGeneList = json.load(f)
207 604.121 MiB 356.621 MiB 1 dgvDf = pd.read_csv(fileName, sep=",")
273 1221.754 MiB 661.543 MiB 2 varDf = pd.read_csv(
304 1703.629 MiB 315.082 MiB 55532 def f(row):
423 1538.184 MiB 27.973 MiB 1 score = load_raw_matrix(annotateInfoDf)
425 1537.270 MiB 25.773 MiB 1 score = hgmdCurate(score)
Can I get an update regarding this issue? The recent PR #61 improved the memory usage and processing time?
Description
The
bin/feature.py
script is currently consuming substantial memory and has slow processing times. We need to optimize this script to improve its performance, focusing on pandas usage optimization and reducing processing time for specific functions.Current Performance Issues
Memory Usage
pd.read_csv
operations.Time Consumption
apply
function onvarDf
takes 27.9% of the total time.hgmdSymMatch
function consumes 65.0% of the total time.