fire / instructor-embedding

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

0 stars 0 forks source link

Calculating embeddings with instructor embedding and store in datasette for querying #1

Open fire opened 1 year ago

fire commented 1 year ago

https://github.com/HKUNLP/instructor-embedding

https://simonwillison.net/2023/Jan/13/semantic-search-answers/

fire commented 1 year ago

Follow semantic-search-answers tutorial.

Install software
Download some data to embed
Load the model
Finding similar content

fire commented 1 year ago

scoop install micromamba
micromamba.exe shell
micromamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -c conda-forge
pip install InstructorEmbedding tqdm sentence-embeddings

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)

fire commented 1 year ago

Instruct embeddings can generate embeddings for:

Sentence similarities
Information Retrieval
Clustering

fire commented 1 year ago

micromamba.exe shell
pip install sklearn

from InstructorEmbedding import INSTRUCTOR
from sklearn.cluster import MiniBatchKMeans
model = INSTRUCTOR('hkunlp/instructor-xl')

sentences = [['Represent the Medicine sentence for clustering: ','Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity'],
             ['Represent the Medicine sentence for clustering: ','Comparison of Atmospheric Neutrino Flux Calculations at Low Energies'],
             ['Represent the Medicine sentence for clustering: ','Fermion Bags in the Massive Gross-Neveu Model'],
             ['Represent the Medicine sentence for clustering: ',"QCD corrections to Associated t-tbar-H production at the Tevatron"],
             ['Represent the Medicine sentence for clustering: ','A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium']]
embeddings = model.encode(sentences)
clustering_model = MiniBatchKMeans(n_clusters=2,
                         random_state=0,
                         batch_size=6,
                         max_iter=10,
                         n_init="auto")
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

[1 0 1 0 0]

fire commented 1 year ago

Use customized embeddings for information retrieval

micromamba.exe shell
pip install sklearn

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query  = [['Represent the Wikipedia question for retrieving supporting documents: ','where is the food stored in a yam plant']]
corpus = [['Represent the Wikipedia document for retrieval: ','Capitalism has been dominant in the Western world since the end of feudalism, but most feel[who?] that the term "mixed economies" more precisely describes most contemporary economies, due to their containing both private-owned and state-owned enterprises. In capitalism, prices determine the demand-supply scale. For example, higher demand for certain goods and services lead to higher prices and lower demand for certain goods lead to lower prices.'],
          ['Represent the Wikipedia document for retrieval: ',"The disparate impact theory is especially controversial under the Fair Housing Act because the Act regulates many activities relating to housing, insurance, and mortgage loansâ€”and some scholars have argued that the theory's use under the Fair Housing Act, combined with extensions of the Community Reinvestment Act, contributed to rise of sub-prime lending and the crash of the U.S. housing market and ensuing global economic recession"],
          ['Represent the Wikipedia document for retrieval: ','Disparate impact in United States labor law refers to practices in employment, housing, and other areas that adversely affect one group of people of a protected characteristic more than another, even though rules applied by employers or landlords are formally neutral. Although the protected classes vary by statute, most federal civil rights laws protect based on race, color, religion, national origin, and sex as protected traits, and some laws include disability status and other traits as well.']]
query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings,corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)

2

fire commented 1 year ago

A common value for BERT & Co. are 512 word pieces, which corresponde to about 300-400 words (for English). Longer texts than this are truncated to the first x word pieces. https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

instructor-embedding has limit of 512 word pieces.

fire commented 1 year ago

Follow tutorial:

scoop install micromamba
micromamba.exe shell
micromamba install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia -c conda-forge
pip install InstructorEmbedding tqdm sentence-embeddings httpx faiss-cpu sentence_transformers

# Run embedding
import httpx
import datetime
import json

def get_blogmarks():
    url = "https://datasette.simonwillison.net/simonwillisonblog/blog_blogmark.json?_size=max&_shape=objects"
    while url:
        data = httpx.get(url, timeout=10).json()
        yield from data["rows"]
        url = data.get("next_url")
        print(url)

blogmarks = list(get_blogmarks())

## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

texts = []
for bm in blogmarks:
    texts.append(["Represent the Science document for retrieval: ", bm["link_title"] + ": " + bm["commentary"]])

# ## And I need the IDs too, to look things up later:

ids = [bm["id"] for bm in blogmarks]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
embeddings = model.encode(texts)
print(datetime.datetime.now().isoformat())

with open("embeddings.json", "w") as fp:
    json.dump(
        {
            "ids": ids,
            "embeddings": [list(map(float, e)) for e in embeddings]
        },
        fp,
    )

# run_search.py
import faiss
import json
import numpy as np

data = json.load(open("embeddings.json"))

ids = data["ids"]

index = faiss.IndexFlatL2(len(data["embeddings"][0]))
index.add(np.array(data["embeddings"]))

def find_similar_for_id(id, k=10):
    idx = ids.index(id)
    embedding = data["embeddings"][idx]
    _, I = index.search(np.array([embedding]), k)
    # Now find the content IDs for the results
    return [ids[ix] for ix in I[0]]

# Example using id=6832
print(find_similar_for_id(6832))

# Tutorial gives [6832, 5545, 6843, 6838, 5573, 6510, 6985, 6957, 5714, 6840] results

# We get [6832, 5714, 6985, 6838, 6840, 6510, 6843, 7019, 5545, 5972]

fire commented 1 year ago

Follow the tutorial to turn the above into a SQL query:

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np

data = json.load(open("embeddings.json"))

ids = data["ids"]

## Run over all the entries.
k = len(ids)
index = faiss.IndexFlatL2(len(data["embeddings"][0]))
index.add(np.array(data["embeddings"]))

def find_distance_similar_for_id(id, k=10):
    idx = ids.index(id)
    embedding = data["embeddings"][idx]
    D, I = index.search(np.array([embedding]), k)
    # Now find the content IDs for the results
    return [{ "id": ids[ix], "distance": D} for ix in I[0]]

query  = [['Represent the Science question for retrieving supporting documents: ','What is sqlite?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

import datetime
print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())
for embedding in query_embeddings:
    D, I = index.search(np.array([embedding]), k)
    print([{ "id": ids[ix], "distance": D} for ix in I[0]])

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_blogmark.link_title,
        blog_blogmark.commentary
    from
        results
    join blog_blogmark on results.id = blog_blogmark.id
    """.format(", ".join(values))
    return sql

def find_similar_for_id(id, k=10):
    idx = ids.index(id)
    embedding = data["embeddings"][idx]
    _, I = index.search(np.array([embedding]), k)
    # Now find the content IDs for the results
    return [ids[ix] for ix in I[0]]

print(id_list_to_sql(find_similar_for_id(6832, k)))

fire commented 1 year ago

Write sql into https://datasette.simonwillison.net/simonwillisonblog

with results(sort, id) as (
    values
        (0, 6832), (1, 5714), (2, 6985), (3, 6838), (4, 6840), (5, 6510), (6, 6843), (7, 7019), (8, 5545), (9, 5972)
    )
    select
        results.sort,
        blog_blogmark.link_title,
        blog_blogmark.commentary
    from
        results
    join blog_blogmark on results.id = blog_blogmark.id

fire commented 1 year ago

jo reviewed this tutorial and mentioned that the article on LSA (a similar approach) has applications: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Commercial_applications

You may find some luck in tweaking it in accordance with another popular variation called BM25 (https://en.wikipedia.org/wiki/Okapi_BM25).

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of...

Compare the documents in the low-dimensional space (data clustering, document classification).

Find similar documents across languages, after analyzing a base set of translated documents (cross-language information retrieval).

Find relations between terms (synonymy and polysemy).

Given a query of terms, translate it into the low-dimensional space, and find matching documents (information retrieval).

Find the best similarity between small groups of terms, in a semantic way (i.e. in a context of a knowledge corpus), as for example in multi choice questions MCQ answering model.[5]

Expand the feature space of machine learning / text mining systems [6]

Analyze word association in text corpus [7]

fire commented 1 year ago

There was a related thread with Tas and the "negative" similarity problem.

https://github.com/HKUNLP/instructor-embedding#training

mrmetaverse commented 1 year ago

I want to share this with the magick team. Can we make this public? Temporarily

fire commented 1 year ago

There was a request to find a distance between rankings.

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np

data = json.load(open("embeddings.json"))

ids = data["ids"]

index = faiss.IndexFlatL2(len(data["embeddings"][0]))
index.add(np.array(data["embeddings"]))

def find_distance_similar_for_id(id, k=10):
    idx = ids.index(id)
    embedding = data["embeddings"][idx]
    D, I = index.search(np.array([embedding]), k)
    # Now find the content IDs for the results
    return [{ "id": ids[ix], "distance": D} for ix in I[0]]

# Example using id=6832
print(find_distance_similar_for_id(6832))

def find_similar_for_id(id, k=10):
    idx = ids.index(id)
    embedding = data["embeddings"][idx]
    _, I = index.search(np.array([embedding]), k)
    # Now find the content IDs for the results
    return [ids[ix] for ix in I[0]]

# Example using id=6832
print(find_similar_for_id(6832))

query  = [['Represent the Science question for retrieving supporting documents: ','What is sqlite?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

import datetime
print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())
k = 10
for embedding in query_embeddings:
    D, I = index.search(np.array([embedding]), k)
    print([{ "id": ids[ix], "distance": D} for ix in I[0]])

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_blogmark.link_title,
        blog_blogmark.commentary
    from
        results
    join blog_blogmark on results.id = blog_blogmark.id
    """.format(", ".join(values))
    return sql

print(id_list_to_sql(find_similar_for_id(6832)))

> python search.py
[{'id': 6832, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 5714, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 6985, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 6838, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 6840, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 6510, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 6843, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 7019, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 5545, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}, {'id': 5972, 'distance': array([[0.        , 0.3499282 , 0.35002232, 0.37230122, 0.3832788 ,
        0.38505322, 0.4014591 , 0.40481126, 0.40543705, 0.41339704]],
      dtype=float32)}]
load INSTRUCTOR_Transformer
max_seq_length  512
2023-02-18T13:00:04.890328
2023-02-18T13:00:11.646496
[{'id': 6829, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6413, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 2339, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6840, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6510, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 5507, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6578, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6585, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 6927, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],
      dtype=float32)}, {'id': 5868, 'distance': array([[0.48008698, 0.49275458, 0.49381375, 0.49479583, 0.5044458 ,
        0.5054495 , 0.5088914 , 0.5125635 , 0.5150317 , 0.51733065]],

Write sql into https://datasette.simonwillison.net/simonwillisonblog

-- query  = [['Represent the Science question for retrieving supporting documents: ','What is sqlite?']]
 with results(sort, id) as (
    values
        (0, 6832), (1, 5714), (2, 6985), (3, 6838), (4, 6840), (5, 6510), (6, 6843), (7, 7019), (8, 5545), (9, 5972)
    )
    select
        results.sort,
        blog_blogmark.link_title,
        blog_blogmark.commentary
    from
        results
    join blog_blogmark on results.id = blog_blogmark.id

fire commented 1 year ago

We must plot these points on a graph, where the distances between them are represented as weights. This will cause nodes to move closer or farther away from their neighbors until the entire system reaches a state of equilibrium.

fire commented 1 year ago

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np

data = json.load(open("embeddings.json"))

ids = data["ids"]

## Run over all the entries.
k = 10
index = faiss.IndexFlatL2(len(data["embeddings"][0]))
index.add(np.array(data["embeddings"]))

query  = [['Represent the Science question for retrieving supporting documents: ','What is sqlite?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

import datetime
print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())
for embedding in query_embeddings:
    distances, anns = index.search(np.array([embedding]), k)
    ranked = []
    ann = anns[0]
    for ann_i in range(len(ann)):
        row = ann[ann_i]
        for distance_i in range(len(distances)):
            column = distances[distance_i][ann_i]
            ranked.append({"id": ids[row], "distance": column**2})

fire commented 1 year ago

Use a scatter plot. https://godotengine.org/asset-library/asset/643

fire commented 1 year ago

Problem

I have 6000 id to distance to 6000 pairwise ids. How do I plot this?

fire commented 1 year ago

https://github.com/hydrosquall/datasette-nteract-data-explorer

fire commented 1 year ago

Cluster the posts.

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime

def get_blogmarks():
    url = "https://datasette.simonwillison.net/simonwillisonblog/blog_blogmark.json?_size=max&_shape=objects"
    while url:
        data = httpx.get(url, timeout=10).json()
        yield from data["rows"]
        url = data.get("next_url")
        print(url)

blogmarks = list(get_blogmarks())

## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

texts = []
for bm in blogmarks:
    texts.append(["Represent the Science document for clustering: ", bm["link_title"] + ": " + bm["commentary"]])

# ## And I need the IDs too, to look things up later:

ids = [bm["id"] for bm in blogmarks]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
embeddings = model.encode(texts)
print(datetime.datetime.now().isoformat())

with open("clustering.json", "w") as fp:
    json.dump(
        {
            "ids": ids,
            "embeddings": [list(map(float, e)) for e in embeddings]
        },
        fp,
    )

data = json.load(open("clustering.json"))
ids = data["ids"]

ncentroids = 1024
niter = 20
verbose = True
embeddings = np.array(data["embeddings"])
d = embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(embeddings)

index = faiss.IndexFlatL2 (d)
index.add(data["embeddings"])
_, I = index.search (kmeans.centroids, 15)
print(ids[ix] for ix in I[0])
print(datetime.datetime.now().isoformat())

fire commented 1 year ago

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime

def get_blogmarks():
    url = "https://datasette.simonwillison.net/simonwillisonblog/blog_entry.json?_size=max&_shape=objects"
    while url:
        data = httpx.get(url, timeout=10).json()
        yield from data["rows"]
        url = data.get("next_url")
        print(url)

blogmarks = list(get_blogmarks())

## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

texts = []
for bm in blogmarks:
    texts.append(["Represent the Science document for clustering: ", bm["title"] + ": " + bm["body"]])

# ## And I need the IDs too, to look things up later:

ids = [bm["id"] for bm in blogmarks]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

print(datetime.datetime.now().isoformat())
embeddings = model.encode(texts)
print(datetime.datetime.now().isoformat())

with open("clustering.json", "w") as fp:
    json.dump(
        {
            "ids": ids,
            "embeddings": [list(map(float, e)) for e in embeddings]
        },
        fp,
    )

data = json.load(open("clustering.json"))
ids = data["ids"]

ncentroids = 1024
niter = 20
verbose = True
embeddings = np.array(data["embeddings"])
d = embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(embeddings)

query  = [['Represent the Science question for retrieving supporting documents: ', 'What is javascript?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id
    """.format(", ".join(values))
    return sql

for embedding in query_embeddings:
    distances, anns = kmeans.index.search(np.array([embedding]), 15)
    nearest_points = []
    ann = anns[0]
    for ann_i in range(len(ann)):
        row = ann[ann_i]
        nearest_points.append(ids[row])
    print("Find the nearest closest 15 points.")
    print(id_list_to_sql(nearest_points))

index = faiss.IndexFlatL2 (d)
index.add(np.array(data["embeddings"]))
distances, anns = index.search(kmeans.centroids, 15)
ranked = []
ann = anns[0]
for ann_i in range(len(ann)):
    row = ann[ann_i]
    ranked.append(ids[row])

print("The articles that are the leading example.")
print(id_list_to_sql(ranked))

fire commented 1 year ago

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime

def get_blogmarks():
    url = "https://datasette.simonwillison.net/simonwillisonblog/blog_entry.json?_size=max&_shape=objects"
    while url:
        data = httpx.get(url, timeout=10).json()
        yield from data["rows"]
        url = data.get("next_url")
        print(url)

blogmarks = list(get_blogmarks())

## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

texts = []
for bm in blogmarks:
    texts.append(["Represent the Science document for clustering: ", bm["title"] + ": " + bm["body"]])

# ## And I need the IDs too, to look things up later:

ids = [bm["id"] for bm in blogmarks]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

print(datetime.datetime.now().isoformat())
embeddings = model.encode(texts)
print(datetime.datetime.now().isoformat())

with open("clustering.json", "w") as fp:
    json.dump(
        {
            "ids": ids,
            "embeddings": [list(map(float, e)) for e in embeddings]
        },
        fp,
    )

data = json.load(open("clustering.json"))
ids = data["ids"]

ncentroids = 1024
niter = 20
verbose = True
embeddings = np.array(data["embeddings"])
d = embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(embeddings)

query  = [['Represent the Science question for retrieving supporting documents: ', 'What is javascript?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id
    """.format(", ".join(values))
    return sql

for embedding in query_embeddings:
    distances, anns = kmeans.index.search(np.array([embedding]), 15)
    nearest_points = []
    ann = anns[0]
    for ann_i in range(len(ann)):
        row = ann[ann_i]
        nearest_points.append(ids[row])
    print("Find the nearest closest 15 points.")
    print(id_list_to_sql(nearest_points))

index = faiss.IndexFlatL2 (d)
index.add(np.array(data["embeddings"]))
distances, anns = index.search(kmeans.centroids, 15)
ranked = []
ann = anns[0]
for ann_i in range(len(ann)):
    row = ann[ann_i]
    ranked.append(ids[row])

print("The articles that are the leading example.")
print(id_list_to_sql(ranked))

fire commented 1 year ago

-- Find the nearest closest 15 points.

    with results(sort, id) as (
    values
        (0, 147), (1, 453), (2, 107), (3, 924), (4, 723), (5, 454), (6, 403), (7, 441), (8, 908), (9, 382), (10, 687), (11, 807), (12, 450), (13, 964), (14, 684)
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id

-- The articles that are the leading example.

    with results(sort, id) as (
    values
        (0, 1580), (1, 7850), (2, 1499), (3, 8093), (4, 8116), (5, 1576), (6, 8084), (7, 7575), (8, 8076), (9, 1497), (10, 7563), (11, 8201), (12, 8115), (13, 7223), (14, 7842)
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id

fire commented 1 year ago

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime

# def get_blogmarks():
#     url = "https://datasette.simonwillison.net/simonwillisonblog/blog_entry.json?_size=max&_shape=objects"
#     while url:
#         data = httpx.get(url, timeout=10).json()
#         yield from data["rows"]
#         url = data.get("next_url")
#         print(url)

# blogmarks = list(get_blogmarks())

# ## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

# texts = []
# for bm in blogmarks:
#     texts.append(["Represent the Science document for clustering: ", bm["title"] + ": " + bm["body"]])

# # ## And I need the IDs too, to look things up later:

# ids = [bm["id"] for bm in blogmarks]

# from InstructorEmbedding import INSTRUCTOR
# model = INSTRUCTOR('hkunlp/instructor-large')

# print(datetime.datetime.now().isoformat())
# embeddings = model.encode(texts)
# print(datetime.datetime.now().isoformat())

# with open("clustering.json", "w") as fp:
#     json.dump(
#         {
#             "ids": ids,
#             "embeddings": [list(map(float, e)) for e in embeddings]
#         },
#         fp,
#     )

data = json.load(open("clustering.json"))
ids = data["ids"]

ncentroids = len(ids)
niter = 20
verbose = True
embeddings = np.array(data["embeddings"])
d = embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(embeddings)

query  = [['Represent the Science question for retrieving supporting documents: ', 'What is sqlite?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id
    """.format(", ".join(values))
    return sql

for embedding in query_embeddings:
    distances, anns = kmeans.index.search(np.array(query_embeddings), 15)
    nearest_points = []
    ann = anns[0]
    for ann_i in range(len(ann)):
        row = ann[ann_i]
        nearest_points.append(ids[row])
    print("Find the nearest closest 15 points.")
    print(id_list_to_sql(nearest_points))

index = faiss.IndexFlatL2 (d)
index.add(np.array(data["embeddings"]))
distances, anns = index.search(kmeans.centroids, 15)
ranked = []
ann = anns[0]
for ann_i in range(len(ann)):
    row = ann[ann_i]
    ranked.append(ids[row])

print("These articles are the most distinct.")
print(id_list_to_sql(ranked))

fire commented 1 year ago

    with results(sort, id) as (
    values
        (0, 817), (1, 8098), (2, 7866), (3, 8187), (4, 7962), (5, 8089), (6, 7865), (7, 8091), (8, 7956), (9, 8139), (10, 8036), (11, 7974), (12, 7925), (13, 7970), (14, 8147)
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id

These articles are the most distinct.

    with results(sort, id) as (
    values
        (0, 1), (1, 1249), (2, 10), (3, 443), (4, 231), (5, 529), (6, 1328), (7, 5), (8, 1371), (9, 1275), (10, 699), (11, 1132), (12, 854), (13, 288), (14, 292)
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id

For the query: "What is sqlite?"

fire commented 1 year ago

What is wikipedia?

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime

# def get_blogmarks():
#     url = "https://datasette.simonwillison.net/simonwillisonblog/blog_entry.json?_size=max&_shape=objects"
#     while url:
#         data = httpx.get(url, timeout=10).json()
#         yield from data["rows"]
#         url = data.get("next_url")
#         print(url)

# blogmarks = list(get_blogmarks())

# ## For each one I need some text - I decided to concatenate the link_title and commentary fields together:

# texts = []
# for bm in blogmarks:
#     texts.append(["Represent the Science document for clustering: ", bm["title"] + ": " + bm["body"]])

# # ## And I need the IDs too, to look things up later:

# ids = [bm["id"] for bm in blogmarks]

# from InstructorEmbedding import INSTRUCTOR
# model = INSTRUCTOR('hkunlp/instructor-large')

# print(datetime.datetime.now().isoformat())
# embeddings = model.encode(texts)
# print(datetime.datetime.now().isoformat())

# with open("clustering.json", "w") as fp:
#     json.dump(
#         {
#             "ids": ids,
#             "embeddings": [list(map(float, e)) for e in embeddings]
#         },
#         fp,
#     )

data = json.load(open("clustering.json"))
ids = data["ids"]

ncentroids = len(ids)
niter = 20
verbose = True
embeddings = np.array(data["embeddings"])
d = embeddings.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
kmeans.train(embeddings)

query  = [['Represent the Science question for retrieving supporting documents: ', 'What is Wikipedia?']]

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-xl')

print(datetime.datetime.now().isoformat())
query_embeddings = model.encode(query)
print(datetime.datetime.now().isoformat())

def id_list_to_sql(ids):
    values = []
    for sort, id in enumerate(ids):
        values.append(f"({sort}, {id})")
    sql = """
    with results(sort, id) as (
    values
        {}
    )
    select
        results.sort,
        blog_entry.title,
        blog_entry.body
    from
        results
    join blog_entry on results.id = blog_entry.id
    """.format(", ".join(values))
    return sql

for embedding in query_embeddings:
    distances, anns = kmeans.index.search(np.array(query_embeddings), 15)
    nearest_points = []
    ann = anns[0]
    for ann_i in range(len(ann)):
        row = ann[ann_i]
        nearest_points.append(ids[row])
    print("Find the nearest closest 15 points.")
    print(id_list_to_sql(nearest_points))

index = faiss.IndexFlatL2 (d)
index.add(np.array(data["embeddings"]))
distances, anns = index.search(kmeans.centroids, 15)
ranked = []
ann = anns[0]
for ann_i in range(len(ann)):
    row = ann[ann_i]
    ranked.append(ids[row])

print("These articles are the most distinct.")
print(id_list_to_sql(ranked))

fire commented 1 year ago

Represent the Manga document for clustering:

https://github.com/victor-soeiro/WebScraping-Projects/tree/main/01%20-%20anime-planet

fire commented 1 year ago

demo_farthest_L2.zip

Find the furthest points.

['Houseki-hime wa Mitsu ni Nurete: Aishite Jewel Star', 'Amiami Romance', 'Kojirase Hyakki Dominor', 'MiMiMies', 'Ikenai Himitsu x Momoiro Nikki', 'Christmas Carol: Manga de Dokuha', 'Star Buddy: Beauty & Little Beast', 'Miku Pure Voice', 'Oitekebori no Hitoribocchi', 'Super Mario Brothers: Ghost Koopa no Gyakushuu', 'Dokidoki Necromantic', 'THE iDOLM@STER: Cinderella Girls Theater', 'Dollhouse no Hitobito', 'Joan of Arc: France wo Sukutta Orleans no Otome', 'Kaettekita Doranko']

fire commented 1 year ago

furthest.zip

["title: Welcome to Demon School, Iruma-kun description: Suzuki Iruma has just been sold to a demon by his irresponsible parents! Surprisingly, the next thing he knows he's living with the demon and has been transferred into a school in the demon world. Thus begins the cowardly Iruma-kun's extraordinary school life among the otherworldly... year: 2017 tags: ['Adventure', 'Comedy', 'Shounen', 'Demons', 'Monster School', 'Person in a Strange World', 'School Life', 'Supernatural', 'Adapted to Anime']", "title: I Found a Husband When I Picked up the Male Lead description: Her family used all their money for her extravagance and luxuries and brought a crisis of bankruptcy. While trying to figure out how to pay off their debts, she found a leaflet from the duke looking for a lost child. The reward is so much money that you can play and eat even after you pay off your debts! Following the memories of reading this book, right away, Lizelle picked up the boy who was caught in a trash in a poor village. She took the lost Lapel and went to the duke. “This is the child the Duke is looking for.” Duke Chester said, looking at me with a doubtful glance. “I need confirmation, so you should stay with my child in this house for the time being.” The strange cohabitation of the three people started like that. However, Lapel keeps thinking of me as a mother and won’t let her leave. year: 2022 tags: ['Fantasy', 'Manhwa', 'Romance', 'Webtoons', 'Childcare', 'Full Color', 'Nobility', 'Based on a Web Novel']"]

fire commented 1 year ago

Still testing.

# Get embedding Json.
# run_search.py
import faiss
import json
import numpy as np
import httpx
import datetime
import csv

# https://stackoverflow.com/a/69613250/381724
with open("comics.csv", mode='r') as infile:
    reader = csv.DictReader(infile, skipinitialspace=True)
    comics = [r for r in reader]

## For each one I need some text - I decided to concatenate the title and description fields together:

texts = []
for c in comics:
    texts.append(["Represent the manga document for retrieval: ", "title: " + c["title"] + " description: " + c["description"] + " year: " + c["year"] + " tags: " + c["tags"]])

print(texts[0])

if False:
    # TODO:  
    # + " cover: " + c["cover"]
    # + " rating: " + c["rating"] 

    ## And I need the IDs too, to look things up later:

    ids = [c["title"] for c in comics]

    from InstructorEmbedding import INSTRUCTOR
    model = INSTRUCTOR('hkunlp/instructor-large')

    print(datetime.datetime.now().isoformat())
    embeddings = model.encode(texts)
    print(datetime.datetime.now().isoformat())

    with open("similarity.json", "w") as fp:
        json.dump(
            {
                "ids": ids,
                "embeddings": [list(map(float, e)) for e in embeddings]
            },
            fp,
        )

print(datetime.datetime.now().isoformat())
data = json.load(open("similarity.json"))
ids = data["ids"]
def augment_queries(xq): 
    extra_column = np.ones((len(xq), 1), dtype=xq.dtype)
    return np.hstack((xq, extra_column))
def augment_database(xb): 
    norms2 = (xb ** 2).sum(1)
    return np.hstack((-2 * xb, norms2[:, None]))
d = len(data["embeddings"][0])
kmeans = faiss.Kmeans(d + 1, 1024, niter=20, verbose=True)
kmeans.train(augment_database(np.array(data["embeddings"])))
query  = []
for title in ids:
    query.append(['Represent the manga question for retrieving supporting documents: ', title])
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
query_embeddings = model.encode(query)
_, anns = kmeans.index.search(augment_queries(query_embeddings), 2)
print([texts[ix] for ix in anns[0]])
print(datetime.datetime.now().isoformat())

kmeans = faiss.Kmeans(d, 1024, niter=20, verbose=True)
kmeans.train(np.array(data["embeddings"]))
query_embeddings = model.encode(query)
_, anns = kmeans.index.search(query_embeddings, 2)
print([texts[ix] for ix in anns[0]])
print(datetime.datetime.now().isoformat())