Open aih opened 2 years ago
Currently, the ngram list is calculated each time a set of bills is compared: https://github.com/aih/bills/blob/a9b073a84c7c171e161fa0191a663b1662a56517/similarity.go#L205
Consider creating the nGramMap over the whole document data set, using that to vectorize the documents, storing the vectorized documents. Then comparison is a matter of counting the pre-processed vectors.
Each vectorized document can be stored next to the document itself in the directory
Currently, the ngram list is calculated each time a set of bills is compared: https://github.com/aih/bills/blob/a9b073a84c7c171e161fa0191a663b1662a56517/similarity.go#L205
Consider creating the nGramMap over the whole document data set, using that to vectorize the documents, storing the vectorized documents. Then comparison is a matter of counting the pre-processed vectors.