caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
61 stars 45 forks source link

42% runtime reduction by replacing np.random.choice #78

Closed wdwvt1 closed 7 years ago

wdwvt1 commented 7 years ago

ST2 is slower than ST1 when fewer than ~2500 features are present in the table. Part of this comes from the unreasonably slow np.random.choice. Fully 55% of the time is spent rescaling the joint probability values and choosing a new index as revealed by this line profiler run (line 569).

Function: gibbs_sampler at line 436

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
 #ommitted
   481                                               """
   482                                               # Basic bookkeeping information we will use throughout the function.
   483         6            7      1.2      0.0      num_sources = cp.V
   484         6            3      0.5      0.0      num_features = cp.tau
   485         6           24      4.0      0.0      source_indices = np.arange(num_sources)
   486         6           22      3.7      0.0      sink = sink.astype(np.int32)
   487         6           43      7.2      0.0      sink_sum = sink.sum()
   488                                           
   489                                               # Calculate the number of passes that need to be conducted.
   490         6            6      1.0      0.0      total_draws = restarts * draws_per_restart
   491         6            6      1.0      0.0      total_passes = burnin + (draws_per_restart - 1) * delay + 1
   492                                           
   493                                               # Results containers.
   494         6           25      4.2      0.0      final_envcounts = np.zeros((total_draws, num_sources), dtype=np.int32)
   495         6          492     82.0      0.0      final_env_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   496         6           83     13.8      0.0      final_taxon_assignments = np.zeros((total_draws, sink_sum), dtype=np.int32)
   497                                           
   498                                               # Sequences from the sink will be randomly assigned a source environment
   499                                               # and then reassigned based on an increasingly accurate set of
   500                                               # probabilities. The order in which the sequences are selected for
   501                                               # reassignment must be random to avoid a systematic bias where the
   502                                               # sequences occuring later in the taxon_sequence book-keeping vector
   503                                               # receive more accurate reassignments by virtue of more updates to the
   504                                               # probability model. 'order' will be shuffled each pass, but can be
   505                                               # instantiated here to avoid unnecessary duplication.
   506         6          202     33.7      0.0      order = np.arange(sink_sum, dtype=np.int32)
   507                                           
   508                                               # Create a bookkeeping vector that keeps track of each sequence in the
   509                                               # sink. Each one will be randomly assigned an environment, and then
   510                                               # reassigned based on the increasinly accurate distribution. sink[i] i's
   511                                               # will be placed in the `taxon_sequence` vector to allow each individual
   512                                               # count to be removed and reassigned.
   513         6          887    147.8      0.0      taxon_sequence = np.repeat(np.arange(num_features), sink).astype(np.int32)
   514                                           
   515                                               # Update the conditional probability class now that we have the sink sum.
   516         6           21      3.5      0.0      cp.set_n(sink_sum)
   517         6          211     35.2      0.0      cp.precalculate()
   518                                           
   519                                               # Several bookkeeping variables that are used within the for loops.
   520         6            3      0.5      0.0      drawcount = 0
   521         6            6      1.0      0.0      unknown_idx = num_sources - 1
   522                                           
   523        36           47      1.3      0.0      for restart in range(restarts):
   524                                                   # Generate random source assignments for each sequence in the sink
   525                                                   # using a uniform distribution.
   526                                                   seq_env_assignments, envcounts = \
   527        30         7735    257.8      0.0              generate_environment_assignments(sink_sum, num_sources)
   528                                           
   529                                                   # Initially, the count of each taxon in the 'unknown' source should be
   530                                                   # 0.
   531        30          160      5.3      0.0          unknown_vector = np.zeros(num_features, dtype=np.int32)
   532        30           20      0.7      0.0          unknown_sum = 0
   533                                           
   534                                                   # If a sequence's random environmental assignment is to the 'unknown'
   535                                                   # environment we alter the training data to include those sequences
   536                                                   # in the 'unknown' source.
   537    797765       657500      0.8      0.1          for e, t in zip(seq_env_assignments, taxon_sequence):
   538    797735       629758      0.8      0.1              if e == unknown_idx:
   539    199601       514084      2.6      0.1                  unknown_vector[t] += 1
   540    199601       149597      0.7      0.0                  unknown_sum += 1
   541                                           
   542       660          762      1.2      0.0          for rep in range(1, total_passes + 1):
   543                                                       # Iterate through sequences in a random order so that no
   544                                                       # systematic bias is introduced based on position in the taxon
   545                                                       # vector (i.e. taxa appearing at the end of the vector getting
   546                                                       # better estimates of the probability).
   547       630       507358    805.3      0.1              np.random.shuffle(order)
   548                                           
   549  16753065     14481942      0.9      1.7              for seq_index in order:
   550  16752435     15050110      0.9      1.8                  e = seq_env_assignments[seq_index]
   551  16752435     14324415      0.9      1.7                  t = taxon_sequence[seq_index]
   552                                           
   553                                                           # Remove the ith sequence and update the probability
   554                                                           # associated with that environment.
   555  16752435     20047543      1.2      2.4                  envcounts[e] -= 1
   556  16752435     13817773      0.8      1.6                  if e == unknown_idx:
   557   6568698     20939355      3.2      2.5                      unknown_vector[t] -= 1
   558   6568698      5282662      0.8      0.6                      unknown_sum -= 1
   559                                           
   560                                                           # Calculate the new joint probability vector based on the
   561                                                           # removal of the ith sequence. Scale this probability vector
   562                                                           # for use by np.random.choice.
   563  16752435     16877702      1.0      2.0                  jp = cp.calculate_cp_slice(t, unknown_vector[t], unknown_sum,
   564  16752435    162371578      9.7     19.4                                             envcounts)
   565                                           
   566                                                           # Reassign the sequence to a new source environment and
   567                                                           # update counts for each environment and the unknown source
   568                                                           # if necessary.
   569  16752435    466547414     27.8     55.6                  new_e_idx = np.random.choice(source_indices, p=jp / jp.sum())
   570                                                           #new_e_idx = np.random.multinomial(n=1, pvals=jp/jp.sum()).argmax()
   571                                           
   572                                           
   573  16752435     19320519      1.2      2.3                  seq_env_assignments[seq_index] = new_e_idx
   574  16752435     23089569      1.4      2.8                  envcounts[new_e_idx] += 1
   575                                           
   576  16752435     14819348      0.9      1.8                  if new_e_idx == unknown_idx:
   577   6706687     23665932      3.5      2.8                      unknown_vector[t] += 1
   578   6706687      5595877      0.8      0.7                      unknown_sum += 1
   579                                           
   580       630          560      0.9      0.0              if rep > burnin and ((rep - (burnin + 1)) % delay) == 0:
   581                                                           # Update envcounts array with the assigned envs.
   582        30           75      2.5      0.0                  final_envcounts[drawcount] = envcounts
   583                                           
   584                                                           # Assign vectors necessary for feature table reconstruction.
   585        30         2088     69.6      0.0                  final_env_assignments[drawcount] = seq_env_assignments
   586        30         1653     55.1      0.0                  final_taxon_assignments[drawcount] = taxon_sequence
   587                                           
   588                                                           # We've made a draw, update this index so that the next
   589                                                           # iteration will be placed in the correct index of results.
   590        30           28      0.9      0.0                  drawcount += 1
   591                                           
   592         6            5      0.8      0.0      return (final_envcounts, final_env_assignments, final_taxon_assignments)

If I use the commented out multinomial code (line 570) as suggested here, I reduce the the runtime by 42% (487s vs 838). Thats a substantial improvement from a single line of code. The only thing stopping this from going in, is that np.random.choice makes a different number of calls to the PRNG compared to np.random.multinomial, so the seeded tests all fail.

The question is: if we are going to remove the seeded tests and recalculate them, is it worth it to also implement a faster algorithm (e.g. the alias method that @wasade suggested)? There is a call to implement the alias method in numpy here, but no plan.