Consider creating nGram maps (vocabulary) over an entire data set, not specific to each comparematrix calculation

aih / bills

A processor for bills in Go

MIT License

0 stars 0 forks source link

Consider creating nGram maps (vocabulary) over an entire data set, not specific to each comparematrix calculation #9

Open aih opened 2 years ago

aih commented 2 years ago

Currently, the ngram list is calculated each time a set of bills is compared: https://github.com/aih/bills/blob/a9b073a84c7c171e161fa0191a663b1662a56517/similarity.go#L205

Consider creating the nGramMap over the whole document data set, using that to vectorize the documents, storing the vectorized documents. Then comparison is a matter of counting the pre-processed vectors.

aih commented 2 years ago

Each vectorized document can be stored next to the document itself in the directory