Possible ways to save memory usage for large datasets / object lists

PeterMakus commented 2 years ago

Hi!

I would like to apply the sequencer to a dataset with about 100,000 objects each containing 500 samples. When started the Sequencer, I encounter an OutOFMemoryError (even though I am working on an HPC with almost 1TB memory available for my computation).

Do you have any suggestions on how I could be able to make the process more efficient (downsampling is not really an option)?

Thanks for your help!

Cheers, Peter

dalya commented 2 years ago

Dear Peter,

Apologies for answering only now, I didn't receive a notification about your open issue.

I suspect that you encounter an OutOFMemoryError because you are using the default setting of the Sequencer, which includes several different distance estimators and scales. For each distance estimator, the Sequencer constructs several distance matrices, each corresponding to the distance within a given segment, as defined by the scale you're interested in. This results in a large number of 100,000*100,000 matrices, which can result in an OutOFMemoryError.

I suggest that you start with a simpler run, where you apply the Sequencer assuming a single distance metric and a single scale, e.g.:

estimator_list = ['L2']
scale_list = [[1]]
seq = sequencer.Sequencer(grid, objects_list, estimator_list, scale_list)

In this example, you apply the Sequencer using only the Euclidean distance, and using a single scale that does not break your object into segments. In this case, the code will construct a single 100,000*100,000 matrix, so there should not be any memory issues.

If this works, then you will need to identify the best distance metrics and scales for your data before applying the Sequencer to the entire data. My suggestion would be to select a random subset of your data (make sure that you sample it randomly in order to avoid various biases), e.g., 5000 objects out of the 100,000. Apply the Sequencer to this subset, and examine all four distance metrics and many different scales, e.g.:

estimator_list = ['EMD', 'energy', 'KL', 'L2']
scale_list = [[1, 2, 4, 8, 16, 32], [1, 2, 4, 8, 16, 32], [1, 2, 4, 8, 16, 32], [1, 2, 4, 8, 16, 32]]
seq = sequencer.Sequencer(grid, objects_list_sampled, estimator_list, scale_list)

where objects_list_sampled is the subset of the data. Use the resulting elongations to identify the best metrics and scales (see basic_sequencer_functionalities.ipynb notebook in the examples directory). Once you have identified the best metrics and scales for the subset of the data, apply the Sequencer to the full dataset using these best metrics and scales. I expect that the number of metrics and scales would be smaller than the default number, allowing you to execute the code without running into memory errors.

Please let me know if you still run into memory errors.

PeterMakus commented 2 years ago

Dear Dalya, Thanks a lot for the thorough response. I will attempt that and let you know whether it worked!

PeterMakus commented 2 years ago

Hi again, I have been trying a little around and even using only 1 core, one estimate and one scale (like you suggested). Lead to an OutOfMemory. Is there any other approach I could still take?

Best, Peter

dalya commented 2 years ago

Dear Peter,

Do you know how much memory you have on this single core? Given the OutofMemory error, I can think of two things to try: (1) try to use more than one core, or (2) try to reduce the size of your array, only to check at what point you are able to run the code without memory issues, which will allow you to estimate how many cores/ how much memory you need to apply the code with a single scale and a single distance metric.

I would also like to note that for most datasets, a sequence can be detected with a smaller number of objects (e.g., 1,000 - 10,000). Once a sequence is detected, one can think about an optimized algorithm to sort the sources according to it, which is tailored exactly to the dataset in hand and is not as general as the sequencer. I would therefore advise you to start with a subset of your data and to check whether there is any interesting trend that comes up. If you are able to find an interesting trend, then you could think about a simpler, more memory-efficient, algorithm to sort the rest of the objects.

Dalya.

PeterMakus commented 2 years ago

Using more than one core won't make more memory available (I've got about 800GB of memory, so actually plenty).

Thanks for your input, I might try with a subset!

andrew-saydjari commented 2 years ago

On a related note, is there any implementation of the algorithm for large datasets described in Section A.3 of the paper? It says there is documentation on the Github, but I failed to find it.

dalya commented 2 years ago

@andrew-saydjari, I am sorry if we indicated somewhere in our paper or Github that we have an implementation for the faster computation described in section A.3. We, unfortunately, do not have something that can be shared (I have all the code but it is not readable to anyone who isn't me..). Can you please let me know where it was written, so I will remove it?

andrew-saydjari commented 2 years ago

In section A.3, the version of the paper downloaded off of ApJ reads "For more details regarding the implementation of this approximate and faster version of the code, see the companion Github repository: https: //github.com/dalya/Sequencer/."

dalya / Sequencer

Possible ways to save memory usage for large datasets / object lists #3