cerebis / bin3C

Extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
GNU Affero General Public License v3.0
23 stars 7 forks source link

Likely "out of memory" exception when analysing a large problem space. #17

Closed orctyr closed 5 years ago

orctyr commented 5 years ago

Hi,

Log file:

DEBUG    | 2019-07-16 12:39:38,312 |    main | bin3C v0.1.1
DEBUG    | 2019-07-16 12:39:38,313 |    main | 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 19:04:19)  [GCC 7.3.0]
DEBUG    | 2019-07-16 12:39:38,313 |    main | Command line: /lustre/sdb/zengl/software/bin3C/bin3C.py cluster --no-spades -v contact_map.p.gz bin3c_clust
INFO     | 2019-07-16 12:39:38,313 |    main | Generated random seed: 9935159
INFO     | 2019-07-16 12:39:38,313 |    main | Loading existing contact map from: contact_map.p.gz
9935159
bin3c_clust
<mzd.contact_map.ContactMap instance at 0x1521978ad440>
DEBUG    | 2019-07-16 12:40:23,768 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5
DEBUG    | 2019-07-16 12:40:23,781 | mzd.contact_map | Using existing mask
INFO     | 2019-07-16 12:40:23,782 | mzd.contact_map | Preparing sequence map with full dimensions: (856104, 856104)
DEBUG    | 2019-07-16 12:40:25,946 | mzd.contact_map | Doing site based normalisation
DEBUG    | 2019-07-16 12:40:26,538 | mzd.contact_map | Map normalized
DEBUG    | 2019-07-16 12:40:26,538 | mzd.contact_map | Balancing contact map
WARNING  | 2019-07-16 12:40:30,344 | mzd.sparse_utils | treating 19 zeros on diagonal as ones
Traceback (most recent call last):
  File "/lustre/sdb/zengl/software/bin3C/bin3C.py", line 205, in <module>
    clustering = cluster_map(cm, method='infomap', seed=args.seed, work_dir=args.OUTDIR)
  File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 156, in cluster_map
    g = to_graph(contact_map, min_len=min_len, min_sig=min_sig, norm=True, bisto=True, scale=True)
  File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 309, in to_graph
    contact_map.prepare_seq_map(norm=norm, bisto=bisto)
  File "/lustre/sdb/zengl/software/bin3C/mzd/contact_map.py", line 914, in prepare_seq_map
    _map, scl = self._bisto_seq(_map)
  File "/lustre/sdb/zengl/software/bin3C/mzd/contact_map.py", line 1075, in _bisto_seq
    _map, scl = sparse_utils.kr_biostochastic(_map)
  File "/lustre/sdb/zengl/software/bin3C/mzd/sparse_utils.py", line 124, in kr_biostochastic
    if not is_hermitian(m, tol):
  File "/lustre/sdb/zengl/software/bin3C/mzd/sparse_utils.py", line 21, in is_hermitian
    print (np.abs(m - m.H) >= tol).todense()
  File "/lustre/sdb/taoye/miniconda3/envs/py2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 846, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/lustre/sdb/taoye/miniconda3/envs/py2.7/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "/lustre/sdb/taoye/miniconda3/envs/py2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
cerebis commented 5 years ago

Hi orctyr, thanks for taking the time to report the error. It is quite late in sydbey, so I’ll have to look at this tomorrow. It would help greatly if you could provide me with a test case that produces the error you are seeing.

orctyr commented 5 years ago

"contact_map.p.gz" generated in step1 is ready, but it is really big (35mb). Please give me an e-mail and I will send to you directly.

cerebis commented 5 years ago

Neither of my email accounts will accept attachments that large, but perhaps that is not what you're proposing.

We could use something such as Dropbox or Drive, or perhaps wetransfer.com.

matt.demaere@gmail.com

On Wed, 17 Jul 2019 at 00:49, orctyr notifications@github.com wrote:

"contact_map.p.gz" generated in step1 is ready, but it is really big (35mb). Please give me an e-mail and I will send to you directly.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC6X2PVV4XA2NIQT5RTP7XNXVA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BDJOY#issuecomment-511849659, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC65PNY5PEA73F6GCMTP7XNXVANCNFSM4IEAV6AQ .

cerebis commented 5 years ago

Now that I've had a chance to look over this exception trace (what's happening under the hood in numpy) I believe you are either running out of memory or running up against a technical addressing limit.

cerebis commented 5 years ago

Hi orctyr,

I have a few recommendations for you, in hopes that it improves your results using bin3C.

  1. First, after looking at the distribution of contigs lengths in your map, I suggest you try raising the minimum contig length to 2000 and see if you can successfully process your dataset. This would nearly halve the size of your problem space.

    With >856,000 sequences, even with sparse matrices you will need a very large amount of memory to run bin3C. For example, if only 1% of matrix elements are non-zero, you would need ~58GB of memory for just the raw values. Overhead within bin3C will mean much more than this minimum is needed, unfortunately.

  2. You have set the minimum signal level to 1. The default is 5 as sufficient signal is necessary for reliable clustering results in the face of noise. I would recommend you try higher values than 1, which will have two benefits. a. the support for the association between contigs will improve. b. the size of your problem space will decrease.

  3. You are using the master branch, but the latest codebase for bin3C is actually on the pgtk branch. Though it is not necessary, a major change in this branch is that it now builds native binaries for external C/C++ tools which are used (such as the clustering tool infomap).

    If you are interested, it can be installed directly from github.

    Assuming pip points to your Python2 environment

    pip uninstall -y bin3C && pip install cython && pip install git+https://github.com/cerebis/bin3C@pgtk
orctyr commented 5 years ago

Thanks @cerebis ! I used a big memory machine and change minimum signal level to 5. It works now. The reason is actually the memory limit. Thanks again.

cerebis commented 5 years ago

Great to hear.

On Wed, 17 Jul 2019 at 13:23, orctyr notifications@github.com wrote:

Thanks @cerebis https://github.com/cerebis ! I used a big memory machine and change minimum signal level to 5. It works now. The reason is actually the memory limit. Thanks again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC4POXUPQXR4XX77HKLP72GBPA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2C5CYY#issuecomment-512086371, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC52NEXFRUBUNQ43SB3P72GBPANCNFSM4IEAV6AQ .

orctyr commented 5 years ago

Another question for results output. I only get mcl file, png and cluster information file were not got.

Part of log file: inspecting clusters: 97%|█████████▋| 38887/40192 [01:53<00:00, 1388.88it/s] inspecting clusters: 99%|█████████▊| 39594/40192 [01:53<00:00, 1388.90it/s] inspecting clusters: 100%|██████████| 40192/40192 [01:54<00:00, 352.16it/ss] ] 9479794 bin3c_clust_t1 <mzd.contact_map.ContactMap instance at 0x14b7a1410560> [[False False False ... False False False] [False False False ... False False False] [False False False ... False False False] ... [False False False ... False False False] [False False False ... False False False] [False False False ... False False False]] Traceback (most recent call last): File "/lustre/sdb/zengl/software/bin3C/bin3C.py", line 215, in write_report(os.path.join(args.OUTDIR, 'cluster_report.csv'), clustering) File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 490, in write_report _n50(sr['length']), File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 479, in _n50 return x[x.cumsum() > x.sum() / 2][0] IndexError: index 0 is out of bounds for axis 0 with size 0

cerebis commented 5 years ago

When parsing the clusters, bin3C also looks at the FastA sequence. The sequence data is not stored win the contact_map.p.gz object, only the path to the original file.

If you move the contact map, you need to reference the FastA on the command line using the --fasta [path-to-file] option.

On Wed, 17 Jul 2019 at 13:36, orctyr notifications@github.com wrote:

Another question for results output. I only get mcl file, png and cluster information file were not got.

Part of log file: inspecting clusters: 97%|█████████▋| 38887/40192 [01:53<00:00, 1388.88it/s] inspecting clusters: 99%|█████████▊| 39594/40192 [01:53<00:00, 1388.90it/s] inspecting clusters: 100%|██████████| 40192/40192 [01:54<00:00, 352.16it/ss] ] 9479794 bin3c_clust_t1 <mzd.contact_map.ContactMap instance at 0x14b7a1410560> [[False False False ... False False False] [False False False ... False False False] [False False False ... False False False] ... [False False False ... False False False] [False False False ... False False False] [False False False ... False False False]] Traceback (most recent call last): File "/lustre/sdb/zengl/software/bin3C/bin3C.py", line 215, in write_report(os.path.join(args.OUTDIR, 'cluster_report.csv'), clustering) File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 490, in write_report _n50(sr['length']), File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 479, in _n50 return x[x.cumsum() > x.sum() / 2][0] IndexError: index 0 is out of bounds for axis 0 with size 0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC7XBTMQLZ3XQA7TSLLP72HTHA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2C5VNY#issuecomment-512088759, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC3LXDYFXOMJYRZONADP72HTHANCNFSM4IEAV6AQ .

orctyr commented 5 years ago

Hi, @cerebis

I add --fasta option, but it remains error...

inspecting clusters: 100%|██████████| 39141/39141 [02:07<00:00, 306.66it/s] Traceback (most recent call last): File "/lustre/sdb/zengl/software/bin3C/bin3C.py", line 215, in write_report(os.path.join(args.OUTDIR, 'cluster_report.csv'), clustering) File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 490, in write_report _n50(sr['length']), File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 479, in _n50 return x[x.cumsum() > x.sum() / 2][0] IndexError: index 0 is out of bounds for axis 0 with size 0

cerebis commented 5 years ago

Ok, sorry you're encountering problems. It is possible to disable some of the output steps, though this is obviously not ideal.

--no-report (skip generating the report) --only-large (only output cluster sequences for clusters >50kb) --no-plot (don't create a heatmap)

I would really prefer to resolve this issue,I believe I would need both your pickled contact map and the reference fasta sequence.

On Wed, 17 Jul 2019 at 16:20, orctyr notifications@github.com wrote:

Hi, @cerebis https://github.com/cerebis

I add --fasta option, but it remains error...

inspecting clusters: 100%|██████████| 39141/39141 [02:07<00:00, 306.66it/s] Traceback (most recent call last): File "/lustre/sdb/zengl/software/bin3C/bin3C.py", line 215, in write_report(os.path.join(args.OUTDIR, 'cluster_report.csv'), clustering) File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 490, in write_report _n50(sr['length']), File "/lustre/sdb/zengl/software/bin3C/mzd/cluster.py", line 479, in _n50 return x[x.cumsum() > x.sum() / 2][0] IndexError: index 0 is out of bounds for axis 0 with size 0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC4VVHUTY7LASFGJQEDP7223FA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DFGCI#issuecomment-512119561, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC7NCRWWOSNGSRXORB3P7223FANCNFSM4IEAV6AQ .

orctyr commented 5 years ago

@cerebis

It is much better get all outputs for further analysis, especially cluster_report.csv and plot png. Right now I only get mcl file and of course I can extract sequences from fasta and mcl files. But the relationship between contigs within one cluster can not be got currently. It is a pity.

cerebis commented 5 years ago

If you could provide me with the map and fasta Ii expect I can resolve this issue.

If not, could you describe the workflow you have used. The problem might be more universal and a test case might show up the same problem.

On Wed, 17 Jul 2019 at 10:51 pm, orctyr notifications@github.com wrote:

@cerebis https://github.com/cerebis

It is much better get all outputs for further analysis, especially cluster_report.csv and plot png. Right now I only get mcl file and of course I can extract sequences from fasta and mcl files. But the relationship between contigs within one cluster can not be got currently. It is a pity.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC5ULIIFQA6FV7NHABTP74IWZA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2ECX4Y#issuecomment-512240627, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC6FTNKDEEWYP2DUL6TP74IWZANCNFSM4IEAV6AQ .

orctyr commented 5 years ago

The input and temp files are really large. Right now I will modify the assemble file and change the sequence id as spades-like one. I will let you @cerebis know once I get the results.

cerebis commented 5 years ago

I believe I have found the problem and I’m just testing the changes. You won’t need to send me any thing. :-)

On Thu, 18 Jul 2019 at 10:49 am, orctyr notifications@github.com wrote:

The input and temp files are really large. Right now I will modify the assemble file and change the sequence id as spades-like one. I will let you @cerebis https://github.com/cerebis know once I get the results.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerebis/bin3C/issues/17?email_source=notifications&email_token=ABN2PC72PYFN2A5LH6ZHBVLP764ZVA5CNFSM4IEAV6A2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2G7XJA#issuecomment-512621476, or mute the thread https://github.com/notifications/unsubscribe-auth/ABN2PC2P3W5OPUZAST57FYTP764ZVANCNFSM4IEAV6AQ .

cerebis commented 5 years ago

You should now be able to run bin3C with only the --no-spades flag enabled.

cerebis commented 5 years ago

I'm closing this issue now, please reopen it if you continue to have the same problem

davidcalfran commented 4 years ago

Dear cerebis, I am having a similar problem. I managed to generate the contact map but it's 58 Mb and when I run it in virtualbox (ubuntu) I got the same error: MemoryError.

Could you specify how to decrease the size of the ContactMap? During the assembly by itself or by adding any specific flag in the commands?

Thank you very much for your attention.