Made two changes to skip the setting of dtype to float during during launching of methods when the dtype is already np.float32.
Refactor how build_cluster was retrieving and calculating the cluster means to make use of numpy's means method which is a lot faster than applying it through pandas. Now running on 200k cells actually finishes in about 5 hours.
There's still a bit where I think can be sped up but I don't understand the code enough to work out how the for-loop is retrieving and updating the interaction database and the base_result object:
Particularly in mean_analysis and percent_analysis (both functions happen during the Running Real Analysis step; mean_analysis happens with every iteration during shuffled_analysis in Running Statistical Analysis step.
Both functions contain this for-loop:
for interaction_index, interaction in interactions.iterrows():
for cluster_interaction in cluster_interactions:
...
# ending in something like this
result.at[index, column] = value
return result
the same goes for build_percent_result, which has the same starting statement.
Made two changes to skip the setting of dtype to float during during launching of methods when the dtype is already
np.float32
.Refactor how
build_cluster
was retrieving and calculating the cluster means to make use of numpy's means method which is a lot faster than applying it through pandas. Now running on 200k cells actually finishes in about 5 hours.There's still a bit where I think can be sped up but I don't understand the code enough to work out how the for-loop is retrieving and updating the interaction database and the base_result object:
Particularly in
mean_analysis
andpercent_analysis
(both functions happen during theRunning Real Analysis
step;mean_analysis
happens with every iteration duringshuffled_analysis
inRunning Statistical Analysis
step.Both functions contain this for-loop:
the same goes for
build_percent_result
, which has the same starting statement.