bagherilab / SWING

Sliding Window Inference for Network Generation
10 stars 12 forks source link

Memory requirements #23

Open joegage opened 4 years ago

joegage commented 4 years ago

Hi - I am interested in generating networks with SWING using transcriptome data from maize. I have ~40k genes and 12 timepoints. Initial testing has me running into memory constraints on a 512Gb memory machine during the create_windows() step. Parameters are set as follows, based mostly on the vignette. I am only currently testing SWING, so they may change in the future:

k_min = 1 k_max = 3 w = 9 method = 'RandomForest'

If I set k_max to 1, I can make it through 2 or three windows before running out of memory. With k_max = 3, I am unable to complete the first step ('nth_window': 3).

Do you have a feeling for how much memory is required for a dataset of this size? Have you run similarly large datasets with SWING before?

Thanks, Joe

justinfinkle commented 4 years ago

I love that you're using it for maize! If I recall correctly the largest network we tested it on was an insilico network of 1K genes. Unfortunately with the size of the genome you're definitely going to hit scalability problems. The search size grows roughly at 2^n so at 40K the full network inference problem is likely intractable without a smarter algorithm.

I'd suggest you do a timeseries analysis to rank genes by how dynamically they change, and remove those that aren't changing over time. With 40K genes you almost certainly have a lot that aren't changing significantly over time and can be removed.

Check out our other paper for how we think about gene expression changing over time and you could get some ideas on how to filter out genes.

joegage commented 4 years ago

Thanks for the suggestions, Justin! I've been able to get it to run with 10k genes, which should be in the right ballpark for a filtered set of genes that are actually changing over time.