Closed jessakay closed 4 years ago
Dear User, I am glad to know cLoops can work nicely with typical sized Hi-C datasets for you. Sorry for the potential problem. Here are some points I want to mention.
Everything seems to work nicely with typical sized Hi-C datasets but when attempting to run on something larger (e.g., ~4e9 contacts genome-wide) with
-eps 5000,10000 -minPts 50,100 -hic
, the following sort of issue pops up:Clustering chr8 and chr8 finished. Estimated 43365022 self-ligation reads and 5506751 inter-ligation reads Traceback (most recent call last): File "/local/anaconda3/envs/cloops/bin/cLoops", line 8, in <module> sys.exit(main()) File "/local/anaconda3/envs/cloops/lib/python2.7/site-packages/cLoops/pipe.py", line 352, in main hic, op.washU, op.juice, op.cut, op.plot, op.max_cut) File "/local/anaconda3/envs/cloops/lib/python2.7/site-packages/cLoops/pipe.py", line 250, in pipe dataI_2, dataS_2, dis_2, dss_2 = runDBSCAN(cfs, ep, m, cut, cpu) File "/local/anaconda3/envs/cloops/lib/python2.7/site-packages/cLoops/pipe.py", line 118, in runDBSCAN for f in fs) File "/local/anaconda3/envs/cloops/lib/python2.7/site-packages/joblib/parallel.py", line 789, in __call__ self.retrieve() File "/local/anaconda3/envs/cloops/lib/python2.7/site-packages/joblib/parallel.py", line 699, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/local/anaconda3/envs/cloops/lib/python2.7/multiprocessing/pool.py", line 572, in get raise self._value multiprocessing.pool.MaybeEncodingError: Error sending result: '[(('chr8', 'chr8'), 'hic/chr8-chr8.jd', ...
Based on the scikit-learn/scikit-learn#8920, I wrapped all the
Parallel()
in pipe.py inside with-blocks using the "threading" back-end and seems to have gotten around the error.My question is whether this is the right way to go about this problem given the "parallel computating bugs" mentioned in the README.
Thank you for the suggestions. I went back to check and indeed there was an issue with memory usage: without changing joblib's back-end resulted in using >700GB of memory (far in excess of the system limit), but only 125GB after the change.
I've been processing each chromosome individually (i.e., splitting the genome-wide bedpe by chromosome), but this shouldn't affect the results, right?
Processing each chromosome individually will not affect the results, only the estimation of self-ligation and inter-ligation cutoffs will be different, if you set the distance cutoff always as 0, then it will all the same. If there is still memory issue, maybe -cut 10000 can be used to remove some close PETs for calling. As you mentioned changing joblib's back-end will reduce a lot the memory, could you please show me several lines of example code ? Maybe I can implement your solutions. Thank you. Best, Yaqiang
On Tue, Feb 11, 2020 at 7:08 AM jessakay notifications@github.com wrote:
Thank you for the suggestions. I went back to check and indeed there was an issue with memory usage: without changing joblib's back-end resulted in using >700GB of memory (far in excess of the system limit), but only 125GB after the change.
I've been processing each chromosome individually (i.e., splitting the genome-wide bedpe by chromosome), but this shouldn't affect the results, right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YaqiangCao/cLoops/issues/18?email_source=notifications&email_token=AAOPQKNM7HZNKPQJP5LEHXTRCKILLA5CNFSM4KSHT5XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELMF2GI#issuecomment-584604953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPQKKB3OHZOTAO3PDO6P3RCKILLANCNFSM4KSHT5XA .
It involved just changing Parallel(n_jobs=cpu)
to Parallel(n_jobs=cpu, backend='threading')
, but the runtime seems to be a bit longer. Though I haven't done any extensive testing.
I will try it. Thank you. Yaqiang
On Wed, Feb 12, 2020 at 10:24 AM jessakay notifications@github.com wrote:
It involved just changing Parallel(n_jobs=cpu) to Parallel(n_jobs=cpu, backend='threading')
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/YaqiangCao/cLoops/issues/18?email_source=notifications&email_token=AAOPQKPYLXVUITYUOFMY3ZDRCQIC3A5CNFSM4KSHT5XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELRFKVQ#issuecomment-585258326, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPQKLNRKX2EHCIPVRQTJDRCQIC3ANCNFSM4KSHT5XA .
Everything seems to work nicely with typical sized Hi-C datasets but when attempting to run on something larger (e.g., ~4e9 contacts genome-wide) with
-eps 5000,10000 -minPts 50,100 -hic
, the following sort of issue pops up:Based on the https://github.com/scikit-learn/scikit-learn/issues/8920, I wrapped all the
Parallel()
in pipe.py inside with-blocks using the "threading" back-end and seems to have gotten around the error.My question is whether this is the right way to go about this problem given the "parallel computating bugs" mentioned in the README.