JCVenterInstitute / NSForest

A machine learning method for the discovery of the minimum marker gene combinations for cell type identification from single-cell RNA sequencing
MIT License
53 stars 20 forks source link

Potential significant optimization with combinations instead of permutations #3

Closed cdarby closed 3 years ago

cdarby commented 3 years ago

Based on my understanding of your algorithm and looking at the "results" and "topResults" csv output files (which contain lines with the same f-measure value for different orderings of a given set of features) I think that at this line

els = [list(x) for x in itertools.permutations(binarylist2, i)]

in the permutor function, you could use the itertools.combinations function and still explore all sets of features required. This would provide significant speedup as there are far fewer combinations than permutations.

BAevermann commented 3 years ago

Thank you for the suggestion. There is considerable debate over whether combinations vs permutations are needed. The permutations take considerably longer, and in my experience do not add much value (as the different permutations seem to always give the same f-beta score). This is probably a deeper issue that will need to be addressed in future development. In NSForest v3, just released, I opted for combinations as this version is meant for scanpy workflows which are time sensitive.

Thanks again, and I apologize for the delay in response!

Brian.