Improving SINGE compatibility

atuldeshpande commented 4 years ago

Hi All, Thanks for using SINGE (rebranded from SCINGE) in this paper. We read your paper recently, and some of the benchmarking figures are visuall very appealing. We have been updating SINGE quite a lot in the past couple of months, and wanted to update you on the same. In the recent versions, we have also added guidance about parameter combinations and ensembling. This includes an updated default_hyperparameters.txt file which ensures that the first instinct of a new user to run would be to replicate the hyperparameters used in our paper. For users who would like to generate their own hyperparameter files, we have added s few scripts to do so in the scripts folder. Based on your workflow, as well as dynverse, we were inspired to support our own Docker image, which is nearing its completion. We'd like to coordinate so that BEELINE users can have access to the latest version of SINGE as we continue our development. By the way, one key attribute of SINGE is the aggregation stage on the ensemble generated. Do you think providing the aggregated scores in a matrix form (in addition to the ranked lists) would make it easier for the users to evaluate SINGE using BEELINE? (I realize that the last couple of points may also be pertinent to #15 )

tmmurali commented 4 years ago

Hi Atul, Nice to get your feedback. We will be glad to coordinate with you and update the version of SINGE. Please work with @adyprat on this. I have two questions/notes:

Has the spelling of your method changed from SCINGE to SINGE? If so, we will need to update our paper.
We have implemented a few ensemble approaches, including the approach you used. That code is in a branch and will be soon in the main branch. We do not find that these ensembles improve the best performing algorithm in a systematic way.

atuldeshpande commented 4 years ago

Thanks.

Yes, the spelling has indeed changed to SINGE now. We have been propagating these spelling changes wherever we can.
That's interesting. For methods like SINGE which have the ensembling in the default workflow, do you perform two levels of ensembling in these tests?

tmmurali commented 4 years ago

Thanks. We will update in our paper.

We did a parameter search to optimise the AUPRC or early precision, rank SINGE with these parameter values, and then used the results in the ensemble. So there was only one level of ensemble. @adyprat Please correct me if I am wrong.

atuldeshpande commented 4 years ago

We did a parameter search to optimise the AUPRC or early precision, rank SINGE with these parameter values, and then used the results in the ensemble.

So if I understand correctly, you choose the best performing hyperparameter for SINGE and then do the ensemble approach? Our approach to SINGE ensembling is slightly different. Our belief is that even the best hyperparameter only sees a partial, inaccurate network. The steps involving hyperparameter ensemble and the SINGE_Aggregate function (was called Modified Borda Aggregation in v0.1.0) aim to combine multiple partial, inaccurate networks to try and rank the more consistently seen edges higher than the less frequently seen ones. I would say that SINGE as a method involves both the multiple GLG tests and the Modified Borda Aggregation on the results of the multiple GLG tests.

adyprat commented 4 years ago

Hi @atuldeshpande,

Thank you for getting back to us. I'm currently trying to integrate the latest version of SINGE into BEELINE. However, I ran into some issues getting the Docker image working. Could you please help me with that?

Thanks, -Aditya

agitter commented 4 years ago

@adyprat thanks for looking at the work-in-progress Docker image. We can help with troubleshooting in your new issue https://github.com/gitter-lab/SINGE/issues/41.

Thanks for your willingness to help propagate the SINGE name change as well. The name change generates confusion but was in response to a specific reviewer request.

adyprat commented 4 years ago

So if I understand correctly, you choose the best performing hyperparameter for SINGE and then do the ensemble approach? Our approach to SINGE ensembling is slightly different. Our belief is that even the best hyperparameter only sees a partial, inaccurate network. The steps involving hyperparameter ensemble and the SINGE_Aggregate function (was called Modified Borda Aggregation in v0.1.0) aim to combine multiple partial, inaccurate networks to try and rank the more consistently seen edges higher than the less frequently seen ones. I would say that SINGE as a method involves both the multiple GLG tests and the Modified Borda Aggregation on the results of the multiple GLG tests.

Hi @atuldeshpande , Sorry for the delay in getting back to you on this. You are correct. As Murali mentioned, we chose the hyperparameter set that achieves the highest median AUPRC for SINGE. We used this approach for every other algorithm that involved multiple parameters as well. Please note that we used v0.1.0 of SINGE since we wanted to be consistent with the version in the bioRXiv preprint, which we cite in our BEELINE paper.

To give you some more detail, we used two different values for lambda: {0, 0.01}; five different dT-numlags combinations: {(3,5), (5,9), (9, 5), (5 ,15), (15, 5)}; and four kernel-width values: {0.5, 1, 2, 4}. We also varied both the prob-zero-removal and the prob-remove-sample parameters in {0 , 0.2} and the number of replicates in: {2, 6}. We choose these parameter values as our initial exploratory analysis showed that increasing lambda over 0.01 to other values used in SINGE paper (0.02, 0.05, 1) often resulted in decreased performance or very few or no output edges. We did not see any improvement in increasing the number of replicates to 10 either. These choices resulted in a total of 320 different hyperparameter sets.

We then ran SINGE with each parameter set to find one that achieved the highest median AUPRC, with the median computed for the Linear synthetic network across the 10 simulated datasets we created for each number of cells (100, 200, 500, 2000 and 5000). Thus, for each number of cells, we obtained one best parameter set. We used these parameter sets for the datasets with the corresponding number of cells for every other synthetic network.

We repeated this process for the 10 datasets without dropouts obtained from each of four curated models, obtaining one best parameter set per curated model. Thus, for every synthetic network-number of cells pair and for every curated model, we had one SINGE output network, which we combined in ensembles with the output networks from every other algorithm.

We do understand that your approach using the hyperparameter ensemble (Modified Borda Aggregation) is different. We used our approach for the following reasons:

The Figures presented in S4–S7 in the bioRxiv version of SINGE paper suggested that there are several hyperparameters that achieve a higher AUPRC and EPR than the ensembled SINGE GRN. So we independently repeated this analysis on our simulated datasets from the Linear network. The first set of box plots below (labeled “100 cells”, “200 cells”, and so on) show the results for 10 simulated datasets from Linear network for each number of cells. The blue box represents the AUPRC values achieved when we used our approach of choosing the best performing hyperparameter set from over 300 hyperparameter sets based on median AUPRC. The brown box represents the AUPRC values achieved when we created modified Borda ensembles over all hyperparameter sets as you did. As you can see, the median we reported is usually (much) better than that achieved by modified Borda. We also observed similar trends for the four Boolean models we used in BEELINE (the plots labeled mCAD, VSC, HSC, and GSD).

download1

download2

Moreover, we tested the GRN inference algorithms on over 400 datasets. Therefore running ensembling on over 300 hyperparameter sets for SCINGE for each dataset individually would have involved significant computational time and resources, while not necessarily improving over best hyperparameter set that achieved highest median AUPRC.

That said, we did investigate if an ensemble built on the outputs from each of the 12 GRN inference methods would result in a higher AUPRC. We did not find that these ensembles improve the best performing algorithm in a systematic way here either. Please note that the AUPRC results (Figure 3 of BEELINE paper) have changed since our bioRxiv preprint as we have modified our BoolODE simulations since then. I hope this reply clarifies the issue of why we chose the best performing hyperparameter set as opposed to computing ensembles over hyperparameters.

cc @agitter , @tmmurali .

atuldeshpande commented 4 years ago

Hi @adyprat, thanks for the detailed reply. I think I understand most of the methodology. Am I right in understanding that the blue box plots in the figure above are for independent models/datasets from the ones used for the optimal hyperparameter search?

tmmurali commented 4 years ago

The blue box plots are from the same datasets that we used for the optimal hyperparameter search.

agitter commented 4 years ago

@adyprat @tmmurali we recently released version 0.4.0 of SINGE, which is the first version with a complete Docker image. I wanted to resume our discussion of whether we should try to more fully integrate SINGE with BEELINE.

The primary question is whether you are interested in using the Docker image we build. I envision that your Algorithms/SCINGE/Dockerfile could build from a specified version of our Docker image and then modify the directory structure or environment. You would no longer need the runSCINGE subdirectory.

If that makes sense to you, we could discuss what other parts, like scingeRunner.py, would need to be updated. I could work on a pull request.

tmmurali commented 4 years ago

@agitter sorry for the delay in replying.

@adyprat What do you think is the best way to go forward? Can we use their Docker image?

ktakers commented 4 years ago

@agitter

Please see https://github.com/Murali-group/Beeline/commit/bace690fa3ad21221cc434b91b1d1b7d09011063 for the integration of the SINGE Dockerfile and SINGE.sh executable into BEELINE.

Because the input to SINGE.sh for gene expression and psuedotime data are MATLAB matfiles, BEELINE still needs the equivalent of the previous executable in runSINGE to convert a CSV count matrix to matfiles before calling SINGE.sh. It might be more convenient if SINGE.sh were to alternatively support input for X, ptime, and gene_list in a different format such as CSV.

agitter commented 4 years ago

@ktakers thank you for taking the time to update the SINGE version and BEELINE wrapper code. I can see it was not easy to rewrite the wrappers. I finally reviewed bace690fa3ad21221cc434b91b1d1b7d09011063 and ba008926a7bb81c0c6013f8582404b545d3fde09 and have a few questions and comments:

Algorithms/SINGE/Dockerfile is using a temporary version of our compiled MATLAB executables. I could make a small pull request to point this to the versioned release instead (now v0.4.1).
Thanks for noticing the bug with file separators in the outdir parameter. We should fix that (https://github.com/gitter-lab/SINGE/issues/58)
What issue did you encounter when using Octave to prepare the input mat files? I would like to support Octave and MATLAB for this step.
I saw your workaround for SINGE's new replicates behavior. I didn't try executing it, but it looks like a good solution.

It might be more convenient if SINGE.sh were to alternatively support input for X, ptime, and gene_list in a different format such as CSV.

We agree. We have an open issue for that (https://github.com/gitter-lab/SINGE/issues/11) but no timeline for resolving it.

ktakers commented 4 years ago

@agitter, thank you for following up and reviewing the commits.

Algorithms/SINGE/Dockerfile is using a temporary version of our compiled MATLAB executables. I could make a small pull request to point this to the versioned release instead (now v0.4.1).

Thank you, I updated the Dockerfile to point to v0.4.1: https://github.com/Murali-group/Beeline/commit/487d25eb608bfd90cc59109bd47031a3b12d35a6 .

What issue did you encounter when using Octave to prepare the input mat files? I would like to support Octave and MATLAB for this step.

At this line https://github.com/gitter-lab/SINGE/blob/master/code/iLasso_for_SINGE.m#L56 , values of m.fullKp are written into a copy of the input mat file before a variable declaration for fullKp, and MATLAB should infer that fullKp is a struct array. This works as expected when using version 7.3 input mat files written by MATLAB, which support partial writes of each value without loading fullKp into memory https://www.mathworks.com/help/matlab/import_export/load-parts-of-variables-from-mat-files.html . As I understand it, the issue I ran into when using a version 7 input mat file written by Octave, which does not support writing version 7.3 mat files, is that MATLAB first loads the fullKp variable into memory, and because that variable hasn't been declared yet MATLAB initializes it, but at that point MATLAB fails to correctly infer that the type should be a struct array. The actual error I get is that the Kp2 struct cannot be assigned to a value in m.fullKp, that full error message copied below:

" Warning: The file '/usr/local/SINGE/TempMat_5.mat' was saved in a format that does not support partial loading. Temporarily loading variable 'fullKp' into memory. To use partial loading efficiently, save MAT-files with the -v7.3 flag.

Error using iLasso_for_SINGE (line 56) Conversion to double from struct is not possible. "

The workaround I implemented in that commit was to initialize the fullKp variable as a struct array in the input mat file, which appears to work but obviously it is not ideal to initialize a specific internal variable that could change in another SINGE version. I think this might be addressed in SINGE to support a version 7 mat file written by Octave by either initializing fullKp as a struct array before that write, loading and saving the input file as a version 7.3 mat file instead of directly copying to a temporary file when parsing the input, or otherwise writing that variable somewhere other than the mat file.

agitter commented 4 years ago

@ktakers thanks for the additional details about the mat file v7 issues. We opened https://github.com/gitter-lab/SINGE/issues/59 and have an idea on how to fix that so Octave can be used to generate the SINGE input files.

agitter commented 3 years ago

@ktakers thanks for reporting the issues with the mat file v7 formatted inputs and file separators in the output directory. The SINGE v0.5.0 release fixes those bugs.

We're also starting to implement support for running SINGE with plain text inputs instead of mat files, but that didn't make it into this release.

ktakers commented 3 years ago

@agitter Thank you for the updated version. I will plan to add the new 0.5.0 release to BEELINE, remove the workaround that I had implemented for the mat file v7 issue when running SINGE in BEELINE, and update this issue.

Murali-group / Beeline

Improving SINGE compatibility #33