ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).
Function: gibbs_sampler at line 436
Line # Hits Time Per Hit % Time Line Contents
==============================================================
#ommitted
481 """
482 # Basic bookkeeping information we will use throughout the function.
483 6 7 1.2 0.0 num_sources = cp.V
484 6 3 0.5 0.0 num_features = cp.tau
485 6 24 4.0 0.0 source_indices = np.arange(num_sources)
486 6 22 3.7 0.0 sink = sink.astype(np.int32)
487 6 43 7.2 0.0 sink_sum = sink.sum()
488
489 # Calculate the number of passes that need to be conducted.
490 6 6 1.0 0.0 total_draws = restarts * draws_per_restart
491 6 6 1.0 0.0 total_passes = burnin + (draws_per_restart - 1) * delay + 1
492
493 # Results containers.
494 6 25 4.2 0.0 final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
495 6 492 82.0 0.0 final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
496 6 83 13.8 0.0 final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
497
498 # Sequences from the sink will be randomly assigned a source environment
499 # and then reassigned based on an increasingly accurate set of
500 # probabilities. The order in which the sequences are selected for
501 # reassignment must be random to avoid a systematic bias where the
502 # sequences occuring later in the taxon_sequence book-keeping vector
503 # receive more accurate reassignments by virtue of more updates to the
504 # probability model. 'order' will be shuffled each pass, but can be
505 # instantiated here to avoid unnecessary duplication.
506 6 202 33.7 0.0 order = np.arange(sink_sum, dtype=np.int32)
507
508 # Create a bookkeeping vector that keeps track of each sequence in the
509 # sink. Each one will be randomly assigned an environment, and then
510 # reassigned based on the increasinly accurate distribution. sink[i] i's
511 # will be placed in the `taxon_sequence` vector to allow each individual
512 # count to be removed and reassigned.
513 6 887 147.8 0.0 taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
514
515 # Update the conditional probability class now that we have the sink sum.
516 6 21 3.5 0.0 cp.set_n(sink_sum)
517 6 211 35.2 0.0 cp.precalculate()
518
519 # Several bookkeeping variables that are used within the for loops.
520 6 3 0.5 0.0 drawcount = 0
521 6 6 1.0 0.0 unknown_idx = num_sources - 1
522
523 36 47 1.3 0.0 for restart in range(restarts):
524 # Generate random source assignments for each sequence in the sink
525 # using a uniform distribution.
526 seq_env_assignments, envcounts = \
527 30 7735 257.8 0.0 generate_environment_assignments(sink_sum, num_sources)
528
529 # Initially, the count of each taxon in the 'unknown' source should be
530 # 0.
531 30 160 5.3 0.0 unknown_vector = np.zeros(num_features, dtype=np.int32)
532 30 20 0.7 0.0 unknown_sum = 0
533
534 # If a sequence's random environmental assignment is to the 'unknown'
535 # environment we alter the training data to include those sequences
536 # in the 'unknown' source.
537 797765 657500 0.8 0.1 for e, t in zip(seq_env_assignments, taxon_sequence):
538 797735 629758 0.8 0.1 if e == unknown_idx:
539 199601 514084 2.6 0.1 unknown_vector[t] += 1
540 199601 149597 0.7 0.0 unknown_sum += 1
541
542 660 762 1.2 0.0 for rep in range(1, total_passes + 1):
543 # Iterate through sequences in a random order so that no
544 # systematic bias is introduced based on position in the taxon
545 # vector (i.e. taxa appearing at the end of the vector getting
546 # better estimates of the probability).
547 630 507358 805.3 0.1 np.random.shuffle(order)
548
549 16753065 14481942 0.9 1.7 for seq_index in order:
550 16752435 15050110 0.9 1.8 e = seq_env_assignments[seq_index]
551 16752435 14324415 0.9 1.7 t = taxon_sequence[seq_index]
552
553 # Remove the ith sequence and update the probability
554 # associated with that environment.
555 16752435 20047543 1.2 2.4 envcounts[e] -= 1
556 16752435 13817773 0.8 1.6 if e == unknown_idx:
557 6568698 20939355 3.2 2.5 unknown_vector[t] -= 1
558 6568698 5282662 0.8 0.6 unknown_sum -= 1
559
560 # Calculate the new joint probability vector based on the
561 # removal of the ith sequence. Scale this probability vector
562 # for use by np.random.choice.
563 16752435 16877702 1.0 2.0 jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
564 16752435 162371578 9.7 19.4 envcounts)
565
566 # Reassign the sequence to a new source environment and
567 # update counts for each environment and the unknown source
568 # if necessary.
569 16752435 466547414 27.8 55.6 new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
570 #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
571
572
573 16752435 19320519 1.2 2.3 seq_env_assignments[seq_index] = new_e_idx
574 16752435 23089569 1.4 2.8 envcounts[new_e_idx] += 1
575
576 16752435 14819348 0.9 1.8 if new_e_idx == unknown_idx:
577 6706687 23665932 3.5 2.8 unknown_vector[t] += 1
578 6706687 5595877 0.8 0.7 unknown_sum += 1
579
580 630 560 0.9 0.0 if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
581 # Update envcounts array with the assigned envs.
582 30 75 2.5 0.0 final_envcounts[drawcount] = envcounts
583
584 # Assign vectors necessary for feature table reconstruction.
585 30 2088 69.6 0.0 final_env_assignments[drawcount] = seq_env_assignments
586 30 1653 55.1 0.0 final_taxon_assignments[drawcount] = taxon_sequence
587
588 # We've made a draw, update this index so that the next
589 # iteration will be placed in the correct index of results.
590 30 28 0.9 0.0 drawcount += 1
591
592 6 5 0.8 0.0 return (final_envcounts, final_env_assignments, final_taxon_assignments)
If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.
The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpyhere, but no plan.
ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow
np.random.choice
. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that
np.random.choice
makes a different number of calls to the PRNG compared tonp.random.multinomial
, so the seeded tests all fail.The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in
numpy
here, but no plan.