Closed crazyhottommy closed 2 years ago
I have a few ideas, but wont be able to get to it for a little while. If you are willing could you tell me :
1) what kind of data are you looking at 2) how many items do you expect to get merged together on average (ie every 4 entries will end up as 1 after merging)
thanks for replying.
you can download the data:
wget http://yiplab.cse.cuhk.edu.hk/jeme/encoderoadmap_lasso.zip
#unzip and test one file
head encoderoadmap_lasso.1.csv
chrX:100040000-100041800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.49
chrX:100046800-100048000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.37
chrX:100128800-100130000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.39
chrX:99749000-99751000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.47
chrX:99851000-99853000,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99854000-99856200,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.68
chrX:99858800-99859600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.59
chrX:99863600-99865600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.57
chrX:99866800-99867800,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
chrX:99868400-99868600,ENSG00000000003.10$TSPAN6$chrX$99894988$-,0.55
## convert to pgl format, add 2kb flanking for the promoter site
cat encoderoadmap_lasso.1.csv | tr "$" ":" | sed -E 's/(chr[0-9XY]+:[0-9]+-[0-9]+),.+(chr[0-9XY]+:[0-9]+):.+/\1,\2/' | tr ":,-" "\t" | awk -v OFS="\t" '{print $1,$2,$3,$4,$5-2000,$5+2000}' | pgltools formatbedpe | sed 's/[ \t]*$//'> encoderoadmap_lasso1.bedpe
If I merge all 127 bedpe together, total 3049673 rows.
## takes very long time, let me try something else..., 3 million rows!
cat *bedpe | pgltools sort | pgltools merge -stdInA > ENCODE_merged_interaction.bedpe
Or I can merge the files recursively. 1.bedpe merge with 2.bedpe, and then the resulted bedpe merge with 3.bedpe...continue...
To implement this recursive function in python, I need the pygl.sort() API to work.( in this issue https://github.com/billgreenwald/pgltools/issues/4)
Thanks! Tommy
A few quick things that may help, since I won't be able to do much change to the implementation for about 2.5 weeks:
1) If you expect the same regions popping up, and you simply want to get a merged set with no duplicates, would running the sorted file through the unix "uniq" command help? 2) Are you running this through Cpython or through pypy? Pypy would help speed up the process a ton 3) The reason that the merge command is slow is, since the pgl format has 2 discrete entries, merging two items in the list can cause a previous item to be now be mergable, even though at first it was not (quick sketch can be found below). This requires going through the list of the entries that have been merged together each iteration, which greatly boosts run time. If you only want to do an immediate merge, I could quickly add an option for this. Otherwise, there is probably an intelligent way to hash the data that could speed this up, but I don't have one in mind right now.
Also, just to put it here as well, the sort function in the python implementation is pygl.pyglSort()
Sketch: Start with 3 entries:
#####----------#####--------
--#####--------------#####--
------#####--------#########
1A overlaps 2A, but not 3A. 1B overlaps 3B, but not 2B. 2 and 3 overlap. Entry 2 and 3 get merged, resulting in:
#####----------#####--------
--#########--------#########
Now entries 1 and 2 can be merged. This requires re-iterating through the list, which boosts run time
Thanks for the illustration!
cat *bedpe | wc -l
3049673
cat *bedpe | cut -f1-6 | sort | uniq | wc -l
667490
This reduced the region number a lot!
I do not know much python :) will check Pypy
I now understand the situation much better (thanks for the visual presentation). I think a immediate merge will help if it speed up a lot. then I can do multiple immediate merge to get the final merged files if necessary.
Learned a lot!
Best, Tommy
will report back how long it takes for the 60k regions. memory is not an issue though?
How much memory are you working with? Pypy will drop memory consumption as well, though it may hit a problem depending on how much you have. I haven't practically tried running merge on a file with hundreds of thousands of entries, but I have run coverage on files with millions of entries. it took a while, but it did use quite a bit of RAM.
I have 35G RAM. Thanks, memory seems to be fine now. I have an idea of speeding up. one can split the files by chromosome, and run pgltools merge by chromosome in parallel and then merge the final list. assume only intra-chromsomomal interactions are in the data. (similar to calling mutations)
Thats a good idea; will take a bit of time to implement. I will give it go in a few weeks
splitting by chr speeds up!! takes only several minutes to merge. Thanks again! I had a write-up here http://crazyhottommy.blogspot.com/2017/12/merge-enhancer-promoter-interaction.html
I'm going to close this and mark as a feature improvement for now.
I also meet some problem when I use pgltools merge. input file like this: chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1 chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2 chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1
my command line: pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl
output file like this:
chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4 my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.
That would be because I think I let the edges count as overlapping when they are exactly the same. ie, bp 1000 as a start overlaps bp 1000 as an end -- I can and should change that. I'll put a fix out this week on it.
Or if you have time, fell free to change it and put in a pull request.
thanks for replying I think I have no confidence to finish this work, but I will try. I will be very grateful if you can fix this. Best wish
Don't worry about it if you can wait a few days.
On Sat, Jul 13, 2019, 7:14 PM LiZhe notifications@github.com wrote:
I also meet some problem when I use pgltools merge. input file like this: chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_1 chr8 37590000 37600000 chr8 38320000 38330000 A ERLIN2_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_1 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_2 chr8 37590000 37600000 chr8 38320000 38330000 B FGFR1_4 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_1 chr8 37590000 37600000 chr8 38330000 38340000 A ERLIN2_2 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_1 chr8 37590000 37600000 chr8 38340000 38350000 A ERLIN2_2 chr8 37590000 37600000 chr8 38350000 38360000 A ERLIN2_1
my command line: pgltools merge -a test.10.pgl -o collapse,distinct -c 7,8 > test.10.merge.pgl
output file like this:
chrA startA endA chrB startB endB collapse_of_7 distinct_of_8
chr8 37590000 37600000 chr8 38320000 38360000 A,A,A,A,B,B,B,A,A,A ERLIN2_1,ERLIN2_2,FGFR1_1,FGFR1_2,FGFR1_4 my question is why pgtools merge two loops when just one archor loci was overlapping. I think that is reasonable merging loop which both anchor were overlaping.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/billgreenwald/pgltools/issues/3?email_source=notifications&email_token=ADDS5EI7XAI4BA725V552BLP7KDZ5A5CNFSM4EIZWYU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ34XIA#issuecomment-511167392, or mute the thread https://github.com/notifications/unsubscribe-auth/ADDS5EKF7YIAX43ZA44XBXTP7KDZ5ANCNFSM4EIZWYUQ .
That's all right. Please tell me when you finish, please. Thanks very much. Best whish
I pushed a new change (version 2.1.5). Could you pull it down and test it and let me know if it works as intended?
I only updated merge right now, but if it runs as it should now, I will add the fix to all the scripts and then push another new version.
Thanks!
I have tested the new version, and it works as intended. But I find another quetion about parameter collapse and distinct in merge. like this: chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3 The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result . I am confused. Tanks!
I have tested the new version, and it works as intended. But I find another quetion about parameter collapse and distinct in merge. like this: chr10 690000 700000 chr10 1100000 1110000 B,B,A,B 4_Strong_Enhancer_16337,PRR26_1,WDR37_1,WDR37_3 The relationship between collapse and distinct , I find PRR26_1 should be in A, but in B in the result . I am confused. Tanks!
I find the sequence between collapse and distinct is reversed.
To clarify first: "collapse" return all annotations in their original order. "distinct" returns a unique set of original annotations, as close to the original order as possible.
So, if four loops are merged, and their annotations originally were A, B, C, D, then collapse and distinct will both return A, B, C, D.
However, if the annotations were A, A, B, A, then collapse will return A, A, B, A, and distinct will return A,B.
I had an ordering function being called but it wasn't doing what I wanted. I changed it and pushed a new version (2.2.0) so that the behavior should be as above; can you let me know if it works?
I also updated all functions to include the edge case you found earlier.
I have a pgl file with over 1.5 million rows, pgltools merge is taking very long time to run. Any way to speed it up?
Thanks for this useful tool!
Tommy