PathwayCommons / semantic-search

A simple semantic search engine for scientific papers.
https://share.streamlit.io/pathwaycommons/semantic-search/semantic_search/demo.py
MIT License
27 stars 0 forks source link

Index embeddings with FAISS #45

Closed JohnGiorgi closed 3 years ago

JohnGiorgi commented 3 years ago

Overview

We should be using FAISS to index the embedding by the PMIDs. That way, when a PMID is included in a request we can first check if it exists in the index rather than pass the text through the neural network again.

The basic process is outlined in the following colab notebook. The trick will be to think of a simple interface within semantic-search to create, load, and update indices.

Rough outline

My general feeling for how indexing will work...

  1. Semantic search will accept an existing FAISS Index that has been saved to disk, or if an index is not provided, create a new one.
  2. During every request, SS will check if the provided IDs are in the index. If not, embed the text and add them.
  3. Use FAISS to perform a cosine similarity lookup between the query and the index. Return the top-k results.

Based on today's conversation, immediate next steps:

JohnGiorgi commented 3 years ago

@Anwesh1 This is a rather big job. I would go through the semantic search repo to get a feel for what it does (ask lots of questions), then the colab above, then the FAISS documentation

Anwesh1 commented 3 years ago

@JohnGiorgi Okay sounds good. I'll go through it, thanks.

Anwesh1 commented 3 years ago

Hey @JohnGiorgi, so I looked around the semantic-search program. I made a basic python script that does the POST stuff. I was playing around with it and it seems to run with the data we get from pubmed-dl, we might have to go back in there and put a comma after every line to adhere to the semantic-search format. I will try and get an understanding of the FAISS stuff as well.

JohnGiorgi commented 3 years ago

@Anwesh1 Do you have an example of what you mean by the comma after each line? semantic-search accepts input text up to MAX_LENGTH (a parameter your pass to the server on startup), you shouldn't need to comma-separate any data (in fact that might cause issues)

Anwesh1 commented 3 years ago

Umm, so essentially I was just testing it inside of a python script I set up. I json.dumps() the data below into the POST request.

    {
        "query": {"uid":"9887103", "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development. The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms..." },
        "documents":[
            {"uid": "9887103",  "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development. The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms..." },

{ "uid": "30049242",  "text": "Transcriptional up-regulation of the TGF-β intracellular signaling transducer Mad of Drosophila larvae in response to parasitic nematode infection. The common fruit fly Drosophila melanogaster is an exceptional model for dissecting innate immunity..."},

{"uid": "22936248", "text": "High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Many eukaryotic genes possess multiple alternative promoters with distinct expression specificities..."}
        ],
        "top_k":3
    }

Here if you can see after every "text" closing brace there is a comma. However, the data we get from Pubmed-dl doesn't have a comma after every record's closing brace. So if we tweak that I think we can pass it as a whole without too much interference? Unless we are planning on going a different way.

jvwong commented 3 years ago

Here's how semantic search works:

{
    "query": "9887103",
    "documents": ["9887103", "30049242", "22936248"],
    "top_k": 3
}
[
  {
    "uid": "9887103",
    "score": 1.0
  },
  {
    "uid": "30049242",
    "score": 0.6427373886108398
  },
  {
    "uid": "22936248",
    "score": 0.4910270869731903
  }
]

Later, this text will be ranked and sent back to the user.

Anwesh1 commented 3 years ago

Hey @jvwong, right I see. When you say that for each user/client provided INPUT uid, I must create a system to fetch the text, what does that mean exactly? They pass me a query (uid, text) then I return the pipeline of data? I'm not exactly sure what you mean by that I must create a system to fetch the text, what text are we talking about?

jvwong commented 3 years ago

Let's talk about this

JohnGiorgi commented 3 years ago

@Anwesh1 The general idea is this:

Currently, during every request to semantic search, we need to run the provided text through a neural network (slow). Even if a subsequent request contains the same IDs, we will still run its text through the NN (wasted computation).

To solve this, the proposal is that during every request to semantic search, we will first check if an ID exists in a FAISS index (the index object in the Colab). Only if it does not do we embed it and then add its embedding to the index. Finally, we use the index to do the semantic similarity lookup (the index.search of the Colab notebook).

The benefits are:

Does this make things a little clearer? Ask lots of questions

Anwesh1 commented 3 years ago

Hey @JohnGiorgi, I've been going through the collab, sort of trying to simulate it on my machine. I'm not sure how the training ties in? Also how exactly do I pass in all the information we have of the PMID/text so that it can be indexed?

JohnGiorgi commented 3 years ago

I'm not sure how the training ties in?

What do you mean by training?

Also how exactly do I pass in all the information we have of the PMID/text so that it can be indexed?

We don't index text, we index vectors. I think that is a point of confusion.

Start simple. Maybe copy the notebook and then modify it so that given a dictionary of PMID, vector key-value pairs, you can create a FAISS index that stores them (both the PMIDs and the vectors can be random for the time being.). Then,

The notebook already does a lot of this.

If you aren't familiar with array manipulation in python, start by learning numpy. There are hundreds of great intro tutorials online. We will probably end up using pytorch because of its GPU support, but this is like a drop-in replacement for numpy so don't worry about it at first.

Let me know if anything remains unclear.

Anwesh1 commented 3 years ago

quick question, do i need to download CUDA?

JohnGiorgi commented 3 years ago

You shouldn't need to. Let's store everything in CPU memory for now as we are just scoping out the problem/solution. In any case, Colab comes with CUDA installed.

Anwesh1 commented 3 years ago

Oh, I see, I was trying to run it locally. I will run it on collab then. It wasn't letting me download CUDA on WSL-Ubuntu for some reason.

Anwesh1 commented 3 years ago

While running the colab notebook, I tried using the cpu_to_gpu function but it threw an error stating: image

Then I took it out because you had stated that it was optional while running it however the index_flat.is_trained stated that index_flat is not defined, so I ran it with just index but the final answer I got was 100000 while on your notebook it was 200000, does this pose a problem by any chance?

JohnGiorgi commented 3 years ago

I can run the notebook end-to-end without error. If you are going to use cpu_to_gpu make sure you have enabled GPU usage (Runtime > Change Runtime Type). The index_flat thing was just a copy-paste error. Should be index. I updated the notebook.

Anwesh1 commented 3 years ago

Hey @JohnGiorgi, I was wondering how we make our data (all records) into vectors so that they can be indexed? I'm assuming each record will have a vector of its own that will be indexed and can be retrieved using the PMID?

Also for the dummy data code in the colab notebook, I think that xb = np.random.random((nb, d)).astype('float32') makes a matrix of nb rows by d coulumns and fills them with random numbers between 0.0 and 1.0 of type float32 right?

However, I'm a little confused by the next line xb[:, 0] += np.arange(nb) / 1000. what is exactly happening here? It looks like all the new computations are being stored in each column of xb? I'm not exactly certain though, hope you can clarify.

also the query = embeddings[0,:].reshape(1, d), what does this do exactly?

JohnGiorgi commented 3 years ago

Hey @JohnGiorgi, I was wondering how we make our data (all records) into vectors so that they can be indexed? I'm assuming each record will have a vector of its own that will be indexed and can be retrieved using the PMID?

Correct.

Also for the dummy data code in the colab notebook, I think that xb = np.random.random((nb, d)).astype('float32') makes a matrix of nb rows by d coulumns and fills them with random numbers between 0.0 and 1.0 of type float32 right?

Yup. Always refer to the documentation when stuck.

However, I'm a little confused by the next line xb[:, 0] += np.arange(nb) / 1000. what is exactly happening here? It looks like all the new computations are being stored in each column of xb? I'm not exactly certain though, hope you can clarify.

Check the FAISS documentation. They say it's "just for fun".

also the query = embeddings[0,:].reshape(1, d), what does this do exactly?

query = embeddings[0,:]
query.shape  # (768,)
query = query.reshape(1, d)
query.shape  # (1, 768)

The query needs to have a leading dimension. In fact, this is true often when working with NumPy arrays (to support broadcasting). This is the kind of thing you learn after playing with NumPy for a while so I would encourage you to spend a few days getting acquainted!

Anwesh1 commented 3 years ago

@JohnGiorgi, where exactly are we converting our text to a vector? In the semantic search? Is it already doing that, with the score that we see now when we pass in a query and a set of documents?

JohnGiorgi commented 3 years ago

where exactly are we converting our text to a vector?

https://github.com/PathwayCommons/semantic-search/blob/2143e1aa8f69e3a2657de018e993fce0a54d9f3e/semantic_search/main.py#L136

Each input is assigned a 768-dimensional vector.

I don't think you need to understand how vectors are generated to solve the problem at hand. If you are just curious for your own sake I would be happy to explain during our next meeting, or you can check out our paper/repo or just read up on semantic text similarity. The core ideas and intuitions aren't much different than word2vec which is an extremely popular technique.

Anwesh1 commented 3 years ago

So I ran some of the tests so we can get a general understanding of how long this stuff takes and also if we can store our data. Firstly, yes we can store the data using write_index(), the data is unreadable by us because it is encoded in some way. It looks like this: image But it is readable by the program with read_index()

Some of the stats are below:

10,000 records

100,000 records

1,000,000 records

Updates:

JohnGiorgi commented 3 years ago

Cool, thanks for quantifying this. Those numbers don't look too bad. Looks like there is some clever compression going on because 10^6 records take only 3X the memory as 10^5 records.

I think we are ready to start working on adding the index to semantic search. How do you feel about starting this on a new branch? We could use Thursday to talk details.

It may also be helpful for you if we do a paired programming session? I have had a lot of success with this using Visual Studio Code's LiveShare feature. Totally up to you though.

Anwesh1 commented 3 years ago

Hey @JohnGiorgi, yeah definitely something smart is going on under the hood. I couldn't run any queries of 1M+ indexes because the RAM maxed out and colab booted me, I doubt we need such massive numbers anyway.

Yeah for sure, we can discuss the details more on Thursday. I can start with a new branch and we can take it from there. The paired programming does sound cool, we can definitely give it a shot when you have the time.