khkk378 commented 3 years ago

Hi! I'm running simultate on 12 datasets and generating 250 pseudobulk samples, and I find it to be surprisingly slow. It takes on the order of 10 hours (50 GB memory). Is this to be expected? I was looking into the code, and found this section:

artificial_samples = []
    for i in range(no_avail_cts):
        ct = available_celltypes[i]
        cells_sub = x.loc[np.array(y["Celltype"] == ct), :]
        cells_fraction = np.random.randint(0, cells_sub.shape[0], samp_fracs[i])
        cells_sub = cells_sub.iloc[cells_fraction, :]
        artificial_samples.append(cells_sub)

    df_samp = pd.concat(artificial_samples, axis=0)
    df_samp = df_samp.sum(axis=0)

Could you not just keep a running sum instead of appending, concatenating and summing?

Cheers, Rasmus

KevinMenden commented 3 years ago

Hi,

yeah it can be rather slow, although this seems to be an extreme case. I would not expect to take so long, honestly ... :thinking: Not quite sure what the issue is. Are those datasets large?

Good point, a running some would work here, too. I'm not so sure whether this will speed up training that much though ... but maybe a little bit.

The data simulation can be made to use multiple cores "relatively easy", I have that in my backlog for some time now, and wanted to add this feature for the next release. This should speed up simulation significantly. Just haven't found the time to do it yet ... maybe this weekend.

If you want to make a PR to implement the running sum code, that would be highly appreciated :)

KevinMenden commented 3 years ago

KevinMenden / scaden

Simulate slow #90

82