Closed JohnGiorgi closed 3 years ago
@Anwesh1 This is a rather big job. I would go through the semantic search repo to get a feel for what it does (ask lots of questions), then the colab above, then the FAISS documentation
@JohnGiorgi Okay sounds good. I'll go through it, thanks.
Hey @JohnGiorgi, so I looked around the semantic-search
program. I made a basic python script that does the POST
stuff. I was playing around with it and it seems to run with the data we get from pubmed-dl
, we might have to go back in there and put a comma after every line to adhere to the semantic-search
format. I will try and get an understanding of the FAISS stuff as well.
@Anwesh1 Do you have an example of what you mean by the comma after each line? semantic-search
accepts input text up to MAX_LENGTH
(a parameter your pass to the server on startup), you shouldn't need to comma-separate any data (in fact that might cause issues)
Umm, so essentially I was just testing it inside of a python script I set up. I json.dumps()
the data below into the POST request.
{
"query": {"uid":"9887103", "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development. The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms..." },
"documents":[
{"uid": "9887103", "text": "The Drosophila activin receptor baboon signals through dSmad2 and controls cell proliferation but not patterning during larval development. The TGF-beta superfamily of growth and differentiation factors, including TGF-beta, Activins and bone morphogenetic proteins (BMPs) play critical roles in regulating the development of many organisms..." },
{ "uid": "30049242", "text": "Transcriptional up-regulation of the TGF-β intracellular signaling transducer Mad of Drosophila larvae in response to parasitic nematode infection. The common fruit fly Drosophila melanogaster is an exceptional model for dissecting innate immunity..."},
{"uid": "22936248", "text": "High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Many eukaryotic genes possess multiple alternative promoters with distinct expression specificities..."}
],
"top_k":3
}
Here if you can see after every "text"
closing brace there is a comma. However, the data we get from Pubmed-dl
doesn't have a comma after every record's closing brace. So if we tweak that I think we can pass it as a whole without too much interference?
Unless we are planning on going a different way.
Here's how semantic search works:
query
uid and a list of uids inside an enclosing documents
{
"query": "9887103",
"documents": ["9887103", "30049242", "22936248"],
"top_k": 3
}
[
{
"uid": "9887103",
"score": 1.0
},
{
"uid": "30049242",
"score": 0.6427373886108398
},
{
"uid": "22936248",
"score": 0.4910270869731903
}
]
PREVIOUS WORK
text
.CURRENT GOAL
text
. Later, this text will be ranked and sent back to the user.
Hey @jvwong, right I see. When you say that for each user/client provided INPUT uid, I must create a system to fetch the text, what does that mean exactly? They pass me a query (uid, text) then I return the pipeline of data? I'm not exactly sure what you mean by that I must create a system to fetch the text
, what text
are we talking about?
Let's talk about this
@Anwesh1 The general idea is this:
Currently, during every request to semantic search, we need to run the provided text through a neural network (slow). Even if a subsequent request contains the same IDs, we will still run its text through the NN (wasted computation).
To solve this, the proposal is that during every request to semantic search, we will first check if an ID exists in a FAISS index (the index
object in the Colab). Only if it does not do we embed it and then add its embedding to the index. Finally, we use the index to do the semantic similarity lookup (the index.search
of the Colab notebook).
The benefits are:
Does this make things a little clearer? Ask lots of questions
Hey @JohnGiorgi, I've been going through the collab, sort of trying to simulate it on my machine. I'm not sure how the training ties in? Also how exactly do I pass in all the information we have of the PMID/text so that it can be indexed?
I'm not sure how the training ties in?
What do you mean by training?
Also how exactly do I pass in all the information we have of the PMID/text so that it can be indexed?
We don't index text, we index vectors. I think that is a point of confusion.
Start simple. Maybe copy the notebook and then modify it so that given a dictionary of PMID, vector key-value pairs, you can create a FAISS index that stores them (both the PMIDs and the vectors can be random for the time being.). Then,
The notebook already does a lot of this.
If you aren't familiar with array manipulation in python, start by learning numpy
. There are hundreds of great intro tutorials online. We will probably end up using pytorch
because of its GPU support, but this is like a drop-in replacement for numpy so don't worry about it at first.
Let me know if anything remains unclear.
quick question, do i need to download CUDA?
You shouldn't need to. Let's store everything in CPU memory for now as we are just scoping out the problem/solution. In any case, Colab comes with CUDA installed.
Oh, I see, I was trying to run it locally. I will run it on collab then. It wasn't letting me download CUDA on WSL-Ubuntu for some reason.
While running the colab notebook, I tried using the cpu_to_gpu
function but it threw an error stating:
Then I took it out because you had stated that it was optional while running it however the index_flat.is_trained
stated that index_flat
is not defined, so I ran it with just index
but the final answer I got was 100000
while on your notebook it was 200000
, does this pose a problem by any chance?
I can run the notebook end-to-end without error. If you are going to use cpu_to_gpu
make sure you have enabled GPU usage (Runtime > Change Runtime Type). The index_flat
thing was just a copy-paste error. Should be index
. I updated the notebook.
Hey @JohnGiorgi, I was wondering how we make our data (all records) into vectors so that they can be indexed? I'm assuming each record will have a vector of its own that will be indexed and can be retrieved using the PMID?
Also for the dummy data code in the colab notebook, I think that xb = np.random.random((nb, d)).astype('float32')
makes a matrix of nb
rows by d
coulumns and fills them with random numbers between 0.0 and 1.0 of type float32 right?
However, I'm a little confused by the next line xb[:, 0] += np.arange(nb) / 1000.
what is exactly happening here? It looks like all the new computations are being stored in each column of xb
? I'm not exactly certain though, hope you can clarify.
also the query = embeddings[0,:].reshape(1, d)
, what does this do exactly?
Hey @JohnGiorgi, I was wondering how we make our data (all records) into vectors so that they can be indexed? I'm assuming each record will have a vector of its own that will be indexed and can be retrieved using the PMID?
Correct.
Also for the dummy data code in the colab notebook, I think that xb = np.random.random((nb, d)).astype('float32') makes a matrix of nb rows by d coulumns and fills them with random numbers between 0.0 and 1.0 of type float32 right?
Yup. Always refer to the documentation when stuck.
However, I'm a little confused by the next line xb[:, 0] += np.arange(nb) / 1000. what is exactly happening here? It looks like all the new computations are being stored in each column of xb? I'm not exactly certain though, hope you can clarify.
Check the FAISS documentation. They say it's "just for fun".
also the query = embeddings[0,:].reshape(1, d), what does this do exactly?
query = embeddings[0,:]
query.shape # (768,)
query = query.reshape(1, d)
query.shape # (1, 768)
The query needs to have a leading dimension. In fact, this is true often when working with NumPy arrays (to support broadcasting). This is the kind of thing you learn after playing with NumPy for a while so I would encourage you to spend a few days getting acquainted!
@JohnGiorgi, where exactly are we converting our text to a vector? In the semantic search? Is it already doing that, with the score that we see now when we pass in a query and a set of documents?
where exactly are we converting our text to a vector?
Each input is assigned a 768-dimensional vector.
I don't think you need to understand how vectors are generated to solve the problem at hand. If you are just curious for your own sake I would be happy to explain during our next meeting, or you can check out our paper/repo or just read up on semantic text similarity. The core ideas and intuitions aren't much different than word2vec which is an extremely popular technique.
So I ran some of the tests so we can get a general understanding of how long this stuff takes and also if we can store our data.
Firstly, yes we can store the data using write_index()
, the data is unreadable by us because it is encoded in some way. It looks like this:
But it is readable by the program with read_index()
Some of the stats are below:
10,000 records
100,000 records
1,000,000 records
Updates:
Cool, thanks for quantifying this. Those numbers don't look too bad. Looks like there is some clever compression going on because 10^6 records take only 3X the memory as 10^5 records.
I think we are ready to start working on adding the index to semantic search. How do you feel about starting this on a new branch? We could use Thursday to talk details.
It may also be helpful for you if we do a paired programming session? I have had a lot of success with this using Visual Studio Code's LiveShare feature. Totally up to you though.
Hey @JohnGiorgi, yeah definitely something smart is going on under the hood. I couldn't run any queries of 1M+ indexes because the RAM maxed out and colab booted me, I doubt we need such massive numbers anyway.
Yeah for sure, we can discuss the details more on Thursday. I can start with a new branch and we can take it from there. The paired programming does sound cool, we can definitely give it a shot when you have the time.
Overview
We should be using FAISS to index the embedding by the PMIDs. That way, when a PMID is included in a request we can first check if it exists in the index rather than pass the text through the neural network again.
The basic process is outlined in the following colab notebook. The trick will be to think of a simple interface within semantic-search to create, load, and update indices.
Rough outline
My general feeling for how indexing will work...
Based on today's conversation, immediate next steps:
faiss.write_index()
)