edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Use sentence-transformers to compare versions #13

Open Mr0grog opened 1 year ago

Mr0grog commented 1 year ago

This won’t get worked on since this project is moving into maintenance mode, but I wanted to write this note out for anyone who might pick up this project in the future or use parts of it in related work.

We’ve talked a bit over time about ways that ML does and doesn’t work for this project (it intuitively makes sense! but the ways in which things are significant are often unique and we have — in ML terms, at least — very few samples to train from). However, I recently learned about SBERT, the Python sentence-transformers library, and the general approach of calculating sentence embeddings for semantic comparison/search. Basically, you can calculate the embeddings (a vector encoding the statistically interesting parts of a string — essentially a “conceptual” map of the string’s content) of a set of strings and then use relatively simple math to compare the embeddings and determine the similarity of the strings. This might be a really useful way to compare changes, rather than the core component of our current method, which is % of text changed.

For example, it can tell us these two sentences are basically the same, despite the rearrangement:

"You and I went down to the river yesterday"
"Yesterday, you and I went down to the river"
> 0.9472 similarity (on a -1 to 1 scale)

This is a different way to get around small wording changes and spelling/grammar corrections that have dogged us for a while (our current method handles this for a small number of these changes, but if a page has a lot of them, it falls down). For example, here’s an actual change that scores a 0.9383 similarity (1 means they are the same, 0 is no relationship, -1 is diametrically opposed):

Screen Shot 2023-02-14 at 7 37 44 PM

(View in Scanner)

This should probably be scored low — we’re mainly just seeing redundant words being removed, abbreviations expanded, and some house style changes. OTOH, it would be good to ask analysts about this! Sometimes patterns of style changes like this across a site have been interesting. 🤷

That said, there are some complexities here! This technique is sentence transformation. It needs more context that single words/tokens to work well, but if you feed it too much you probably start losing information and the ability to compare gets lower and lower. I suspect it wouldn’t perform well if you just boiled down the text content of a whole page body to a single embedding vector and compared that (but maybe worth trying!). Another approach might be to go paragraph-by-paragraph (block element by block element, or whatever). Or possibly look at each change: grab the surrounding sentence/paragraph content for a change, calculate the embedding for that and compare change-by-change and report the minimum value for all changes. Or maybe something else.

Another fun idea: you could combine the change-by-change idea above with the UI, tag each change on the page with information about how different it is. A little complex because I think sentence-transforms is only available in Python, so you’d need a service to report the stats or you’d need to find a JS implementation (or maybe do something with WASM) to run it in the browser.


Basic code for the above (based on https://www.sbert.net/docs/usage/semantic_textual_similarity.html):

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def compare_texts(a, b):
    e1 = model.encode([a], convert_to_tensor=True)
    e2 = model.encode([b], convert_to_tensor=True)
    return = util.cos_sim(e1, e2)

compare_texts(
    ("Dicamba is a selective herbicide in the benzoic acid family of chemicals. "
     "It is already registered for use in agriculture on corn, wheat and other "
     "crops.\n\n"
     "Dicamba is also registered for non-agricultural uses in residential areas, "
     "and other sites such as golf courses, mainly to control broadleaf weeds "
     "such as dandelions, chickweed, clover and ground ivy.Only dicamba products "
     "registered for use on GE cotton and soybean can be applied “over the top” "
     "(to growing plants). It is a violation of FIFRA to use any other dicamba "
     "product that is not registered for use on GE crops “over the top” on crops."),

    ("Dicamba is a selective herbicide in the benzoic acid family of chemicals. "
     "It is registered for use in agriculture on corn, wheat and other crops.\n\n"
     "Dicamba is also registered for non-agricultural uses in residential areas "
     "and other sites, such as golf courses. At these types of sites, it is "
     "primarily used to control broadleaf weeds such as dandelions, chickweed, "
     "clover and ground ivy.\n\n"
     "Only dicamba products registered for use on genetically engineered cotton "
     "and soybean can be applied “over-the-top” (to growing plants). It is a "
     "violation of the Federal Insecticide, Fungicide, and Rodenticide Act "
     "(FIFRA) to use any dicamba product on crops that is not registered for "
     "over-the-top use on genetically engineered crops.")
)

Some useful references on this stuff:

Some changes from Scanner I looked at:

Mr0grog commented 1 year ago

Follow-up note about breaking comparisons down into smaller pieces: the different models have a max_seq_length property, which is the maximum number of word tokens they’ll evaluate before truncating the data. For all-MiniLM-L6-v2 model in my example, that’s 256. That means it can handle large paragraphs, but anything after the first 256 words doesn’t actually factor into the embedding you get from model.encode().