Error during cosine normalization step

LucasESBS commented 6 years ago

Hello,

First, thank you for this awesome implementation of MNN in python , great work! I was looking forward to try it and compare it with Seurat's CCA on my 5 batches, but got the following error when running the mnn_correct function:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-35-cf9c40cb7f9e> in <module>()
     11 sc.pp.log1p(week11_LV)
     12 
---> 13 corrected = mnnpy.mnn_correct(week11_IVS, week11_LA, week11_RA, week11_RV, week11_LV,var_subset=subset_genes, batch_categories = ["EF11W4D_IVS", "EF11W4D_LA", "EF11W4D_RA","EF11W4D_RV","EF11W4D_LV"])
     14 adata = corrected[0]

~/anaconda3/lib/python3.6/site-packages/mnnpy/mnn.py in mnn_correct(var_index, var_subset, batch_key, index_unique, batch_categories, k, sigma, cos_norm_in, cos_norm_out, svd_dim, var_adj, compute_angle, mnn_order, svd_mode, do_concatenate, save_raw, n_jobs, *datas, **kwargs)
    124                                 cos_norm_out=cos_norm_out, svd_dim=svd_dim, var_adj=var_adj,
    125                                 compute_angle=compute_angle, mnn_order=mnn_order,
--> 126                                 svd_mode=svd_mode, do_concatenate=do_concatenate, **kwargs)
    127         print('Packing AnnData object...')
    128         if do_concatenate:

~/anaconda3/lib/python3.6/site-packages/mnnpy/mnn.py in mnn_correct(var_index, var_subset, batch_key, index_unique, batch_categories, k, sigma, cos_norm_in, cos_norm_out, svd_dim, var_adj, compute_angle, mnn_order, svd_mode, do_concatenate, save_raw, n_jobs, *datas, **kwargs)
    155     in_batches, out_batches, var_subset, same_set = transform_input_data(datas, cos_norm_in,
    156                                                                          cos_norm_out, var_index,
--> 157                                                                          var_subset, n_jobs)
    158     if mnn_order is None:
    159         mnn_order = list(range(n_batch))

~/anaconda3/lib/python3.6/site-packages/mnnpy/utils.py in transform_input_data(datas, cos_norm_in, cos_norm_out, var_index, var_subset, n_jobs)
     59         else:
     60             with Pool(n_jobs) as p_n:
---> 61                 out_scaling = p_n.map(l2_norm, datas)
     62                 out_scaling = [scaling[:, None] for scaling in out_scaling]
     63                 out_batches = p_n.starmap(scale_rows, zip(datas, out_scaling))

~/anaconda3/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    264         in a list that is returned.
    265         '''
--> 266         return self._map_async(func, iterable, mapstar, chunksize).get()
    267 
    268     def starmap(self, func, iterable, chunksize=None):

~/anaconda3/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
    642             return self._value
    643         else:
--> 644             raise self._value
    645 
    646     def _set(self, i, obj):

~/anaconda3/lib/python3.6/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    422                         break
    423                     try:
--> 424                         put(task)
    425                     except Exception as e:
    426                         job, idx = task[:2]

~/anaconda3/lib/python3.6/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(_ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

~/anaconda3/lib/python3.6/multiprocessing/connection.py in _send_bytes(self, buf)
    391         n = len(buf)
    392         # For wire compatibility with 3.2 and lower
--> 393         header = struct.pack("!i", n)
    394         if n > 16384:
    395             # The payload is large so Nagle's algorithm won't be triggered

error: 'i' format requires -2147483648 <= number <= 2147483647

I used 500 highly variable genes; is that too much ? Can it be for another reason? In general, how do the computational time scales with the number of genes selected ? It would be nice to have an idea of a good number of HVGs to be included to have a decent running time without losing too much information.

Thank you, and once again congrats for your work!

chriscainx commented 6 years ago

Thanks for the encouragement Lucas!

I didn’t encounter this problem during my tests, looks odd... It is definitely not related to the number of HVGs though, in my case I used >5000 HVGs and no error occurred. This looks more like a python internal issue, maybe it has something to do with the Numba JIT system. I will look into this and keep you posted.

chriscainx commented 6 years ago

Regarding time consumption, if you choose to input all the genes (g) but subset with hvgs (h), the algorithm will run twice (m*n*g + m*n*h, m=batch 1, n=batch 2), because the whole space needs to be projected using the MNNs calculated using h. Calculating MNN uses only h, but it’s fast. When h is small, the majority of time is consumed by all gene computation, changing h will not affect running time very much.

If you want to speed it up, a good plan would be subset the genes to hvgs first, and do no subsetting, since other genes are typically unnecessary in later steps (PCA, cluster) anyway... In this case, the time will be m*n*h. I think this is most efficient.

Actually, what’s the specs of your machine? It’s possible that your problem has something to do with it.

chriscainx commented 6 years ago

Hi Lucas, could you update to 0.1.9.3 and check if the error occurs? Thank you!

LucasESBS commented 6 years ago

Thank you for your reply ! Sorry for not getting back to you earlier. I tried to update to newest version but the error is unfortunately still there. Specs of my machine are:

Linux x86_64 GNU/Linux MemTotal: 527818508 kB Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (56)

Hope that helps, Thanks again for your help!

cfriedline commented 6 years ago

I ran into this, as well. Appears to be multiprocessing. If I do mnnpy.settings.normalization = "seq", things proceed past the cosine normalization step.

I have 3 AnnData objects: AnnData object with n_obs × n_vars = 15992 × 41861 AnnData object with n_obs × n_vars = 13998 × 41861 AnnData object with n_obs × n_vars = 13325 × 41861

I can get it to run in parallel if I subset the AnnData objects:

corrected = mnnpy.mnn_correct(a2data[0:1000], b2data[0:1000], b3data[0:1000], var_subset=hvg, batch_categories=["a2", "b2", "c2"])

chriscainx / mnnpy

Error during cosine normalization step #4