colomemaria / epiScanpy

Episcanpy: Epigenomics Single Cell Analysis in Python
BSD 3-Clause "New" or "Revised" License
139 stars 33 forks source link

Extremely long runtime bld_mtx_fly() #57

Closed le-ander closed 4 years ago

le-ander commented 4 years ago

Hey,

I having some problems with the extremely long runtime of the bld_mtx_fly() function on my scATAC dataset. I have a 10x dataset with 2.5k cells and am trying to build a feature matrix from the fragments.tsv file using an geneid annotation BED file with 33k features.

I have added a tqdm progress bar to get an idea of the performance.

After 10 Minutes, this is that I see:

0%|          | 11/32774 [09:37<617:20:22, 67.83s/it]

This is telling me an expected runtime of 3 weeks, which seems to be too much, considering this statement in the function doc string "Expected running time for 10k cells X 100k features on a personal computer ~65min"

When I use 10kb windows from the make_windows() function, I get over 2 million features and an expected runtime of several years:

0%|          | 111/2399976 [55:19<46456:43:07, 69.69s/it]  

Any suggestion how I could still use this function and get started with my ATAC analysis?

Thanks a lot and best regards, Leander

DaneseAnna commented 4 years ago

This is way too long. I have no clue why right now, let me think about it.

le-ander commented 4 years ago

I just copied the file to the local disk of the serve I am running this on instead of continuously fetching the files from a network drive. It helped a bit, but not very much.

0%|          | 11/32774 [07:51<398:27:50, 43.78s/it]
le-ander commented 4 years ago

I can't really seem to find a way speeding it up right now. Any idea? 🤔

DaneseAnna commented 4 years ago

I am working on it right now. I ran the same script on my computer and the server. For some reason it finished on my computer but not on the server. So I am trying to figure out what is going wrong.

le-ander commented 4 years ago

I just tried it on my laptop as well and it does not seems to be faster than on the server:

0%|          | 11/32774 [24:08<1558:32:58, 171.25s/it]

My fragments.tsv file has around 140 million lines.

le-ander commented 4 years ago

I think I found the problem. https://github.com/colomemaria/epiScanpy/blob/70e568e1cf1ca247168a57f0058d60627bc28b8b/episcanpy/count_matrix/_bld_atac_mtx.py#L86 This line goes through the whole barcode list for every single fragment. Avoiding this repeated iteration speeds up matrix building by approximately 10000 fold.

I have rewritten the function and will open a pull request shortly.

DaneseAnna commented 4 years ago

Okay, thanks !