billgreenwald / pgltools

Paired Genomic Loci Tool Suite
BSD 3-Clause "New" or "Revised" License
29 stars 5 forks source link

performance for pgltools merge #3

Closed crazyhottommy closed 2 years ago

crazyhottommy commented 6 years ago

I have a pgl file with over 1.5 million rows, pgltools merge is taking very long time to run. Any way to speed it up?

Thanks for this useful tool!

Tommy

billgreenwald commented 6 years ago

I have a few ideas, but wont be able to get to it for a little while. If you are willing could you tell me :

1) what kind of data are you looking at 2) how many items do you expect to get merged together on average (ie every 4 entries will end up as 1 after merging)

crazyhottommy commented 6 years ago

thanks for replying.

  1. those are chromatin interaction data. In Reconstruction of enhancer–target networks in 935 samples of human primary cells, tissues and cell lines. They have 127 interaction data from ENCODE. I was trying to merge all 127 files together to make a super-set

you can download the data:

wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip

#unzip and test one file
head encoderoadmap_lasso.1.csv 
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55

## convert to pgl format, add 2kb flanking for the promoter site

cat   encoderoadmap_lasso.1.csv | tr "$" ":" |  sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe |  sed 's/[ \t]*$//'> encoderoadmap_lasso1.bedpe

If I merge all 127 bedpe together, total 3049673 rows.

## takes very long time, let me try something else..., 3 million rows!
cat *bedpe | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe
  1. I expect to see some overlaps between different files, but not sure how frequent are they, if they are from the same cell type, most of them will be overlapping since enhancers are cell type specific.

Or I can merge the files recursively. 1.bedpe merge with 2.bedpe, and then the resulted bedpe merge with 3.bedpe...continue...

To implement this recursive function in python, I need the pygl.sort() API to work.( in this issue https://github.com/billgreenwald/pgltools/issues/4)

Thanks! Tommy

billgreenwald commented 6 years ago

A few quick things that may help, since I won't be able to do much change to the implementation for about 2.5 weeks:

1) If you expect the same regions popping up, and you simply want to get a merged set with no duplicates, would running the sorted file through the unix "uniq" command help? 2) Are you running this through Cpython or through pypy? Pypy would help speed up the process a ton 3) The reason that the merge command is slow is, since the pgl format has 2 discrete entries, merging two items in the list can cause a previous item to be now be mergable, even though at first it was not (quick sketch can be found below). This requires going through the list of the entries that have been merged together each iteration, which greatly boosts run time. If you only want to do an immediate merge, I could quickly add an option for this. Otherwise, there is probably an intelligent way to hash the data that could speed this up, but I don't have one in mind right now.

Also, just to put it here as well, the sort function in the python implementation is pygl.pyglSort()

Sketch: Start with 3 entries:

#####----------#####--------
--#####--------------#####--
------#####--------#########

1A overlaps 2A, but not 3A. 1B overlaps 3B, but not 2B. 2 and 3 overlap. Entry 2 and 3 get merged, resulting in:

#####----------#####--------
--#########--------#########

Now entries 1 and 2 can be merged. This requires re-iterating through the list, which boosts run time

crazyhottommy commented 6 years ago

Thanks for the illustration!

  1. Man! this is smart, I should have done this first! initially I thought they will not be exactly the same regions...but some bases off, apparently, I was wrong:
 cat *bedpe | wc -l
3049673
cat *bedpe | cut -f1-6 | sort | uniq | wc -l
667490

This reduced the region number a lot!

  1. I do not know much python :) will check Pypy

  2. I now understand the situation much better (thanks for the visual presentation). I think a immediate merge will help if it speed up a lot. then I can do multiple immediate merge to get the final merged files if necessary.

Learned a lot!

Best, Tommy

crazyhottommy commented 6 years ago

will report back how long it takes for the 60k regions. memory is not an issue though?

billgreenwald commented 6 years ago

How much memory are you working with? Pypy will drop memory consumption as well, though it may hit a problem depending on how much you have. I haven't practically tried running merge on a file with hundreds of thousands of entries, but I have run coverage on files with millions of entries. it took a while, but it did use quite a bit of RAM.

crazyhottommy commented 6 years ago

I have 35G RAM. Thanks, memory seems to be fine now. I have an idea of speeding up. one can split the files by chromosome, and run pgltools merge by chromosome in parallel and then merge the final list. assume only intra-chromsomomal interactions are in the data. (similar to calling mutations)

billgreenwald commented 6 years ago

Thats a good idea; will take a bit of time to implement. I will give it go in a few weeks

crazyhottommy commented 6 years ago

splitting by chr speeds up!! takes only several minutes to merge. Thanks again! I had a write-up here http://crazyhottommy.blogspot.com/2017/12/merge-enhancer-promoter-interaction.html

billgreenwald commented 6 years ago

I'm going to close this and mark as a feature improvement for now.

StayHungryStayFool commented 5 years ago

I also meet some problem when I use pgltools merge. input file like this: chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1 chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2 chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1

my command line: pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl

output file like this:

chrA startA endA chrB startB endB collapse_of_7 distinct_of_8

chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4 my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.

billgreenwald commented 5 years ago

That would be because I think I let the edges count as overlapping when they are exactly the same. ie, bp 1000 as a start overlaps bp 1000 as an end -- I can and should change that. I'll put a fix out this week on it.

Or if you have time, fell free to change it and put in a pull request.

StayHungryStayFool commented 5 years ago

thanks for replying I think I have no confidence to finish this work, but I will try. I will be very grateful if you can fix this. Best wish

billgreenwald commented 5 years ago

Don't worry about it if you can wait a few days.

On Sat, Jul 13, 2019, 7:14 PM LiZhe notifications@github.com wrote:

I also meet some problem when I use pgltools merge. input file like this: chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1 chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2 chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1

my command line: pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl

output file like this:

chrA startA endA chrB startB endB collapse_of_7 distinct_of_8

chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4 my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/billgreenwald/pgltools/issues/3?email_source=notifications&email_token=ADDS5EI7XAI4BA725V552BLP7KDZ5A5CNFSM4EIZWYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ34XIA#issuecomment-511167392, or mute the thread https://github.com/notifications/unsubscribe-auth/ADDS5EKF7YIAX43ZA44XBXTP7KDZ5ANCNFSM4EIZWYUQ .

StayHungryStayFool commented 5 years ago

That's all right. Please tell me when you finish, please. Thanks very much. Best whish

billgreenwald commented 5 years ago

I pushed a new change (version 2.1.5). Could you pull it down and test it and let me know if it works as intended?

I only updated merge right now, but if it runs as it should now, I will add the fix to all the scripts and then push another new version.

Thanks!

StayHungryStayFool commented 5 years ago

I have tested the new version, and it works as intended. But I find another quetion about parameter collapse and distinct in merge. like this: chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3 The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result . I am confused. Tanks!

StayHungryStayFool commented 5 years ago

I have tested the new version, and it works as intended. But I find another quetion about parameter collapse and distinct in merge. like this: chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3 The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result . I am confused. Tanks!

I find the sequence between collapse and distinct is reversed.

billgreenwald commented 5 years ago

To clarify first: "collapse" return all annotations in their original order. "distinct" returns a unique set of original annotations, as close to the original order as possible.

So, if four loops are merged, and their annotations originally were A, B, C, D, then collapse and distinct will both return A, B, C, D.

However, if the annotations were A, A, B, A, then collapse will return A, A, B, A, and distinct will return A,B.

I had an ordering function being called but it wasn't doing what I wanted. I changed it and pushed a new version (2.2.0) so that the behavior should be as above; can you let me know if it works?

I also updated all functions to include the edge case you found earlier.