epiverse-connect / epiverse-search

MIT License
1 stars 0 forks source link

1 data point per document vs 1 data point per package #18

Open Bisaloo opened 1 month ago

Bisaloo commented 1 month ago

This again came up in a discussion with @avinashladdha:

Do we want to have 1 data point per document or 1 data point per package? How to make this happen?

From a user point of view, it probably makes more sense to have a single data point (single point on the map & single search answer) per package.

Currently, we have multiple documents per package so if we wanted to have 1 point per package, how can we do this?

I don't know if it makes sense to concatenate all documents to have a single document per package as we may end up averaging points we a large amount of variability and end up with an average that is not meaningful.

@avinashladdha mentioned we could have a post-process deduplication step where we keep only the best score / best matching document for each package. Are there any downsides to this approach? How could we apply something similar to the map?

avinashladdha commented 1 month ago

From a user point of view, it probably makes more sense to have a single data point (single point on the map & single search answer) per package.

Currently we are returning the final response on package level instead of document/module level hence the user will get one data point with either approach.

a. For 1-document-1-embedding approach: Calculates the similarity score for each document in a package and return the highest score. The top 3 scores (and thus relevant package name) across all packages are returned as the final output.

b. For 1-package-1-embedding approach: Calculate similarity score for each package (need to aggregate embeddings across all documents) and return top 3 results.

avinashladdha commented 1 month ago

Based on what is more relevant for Epi package search we can take either route.

From embedding/visualisation point I would note: a. 1 embedding for 1 document - Concatenating all documents might not be the best approach as it will dilute individual document nuances and we might need to break down texts into chunks if the text size of combined document is beyond a threshold. In which case we will need to aggregate the embeddings.

Other approaches of aggregating embedding from a single package could be the following:

  1. Averaging embeddings for a package across the constituent documents.
  2. Better still is to average word vectors and then subtract the first principal component, which reduces the dominance of common words. This may retain more meaningful semantic information.
Bisaloo commented 1 month ago

This was discussed today with the WHO Collaboratory team at our monthly stand up and there were not a very strong push to either side.

There seemed to be a small preference for returning specific tool (i.e., 1 data point per package), with the caveat that we should then also indicate which document led to the high score.

One potential option may also be to create and present both alternatives and see which one gathers more positive user feedback.

paulkorir commented 1 month ago

I also agree that we should have results drill down to the module though I'm not sure how many embeddings this implies. From my experiments (https://github.com/paulkorir/working-with-embeddings/blob/master/experiments.py) I would imagine that you will only have one embedding model. I could be wrong.

paulkorir commented 1 month ago

OK. I've been getting up to scratch with the topic and it seems to me that the application of an embedding model on a set of documents results in the as set of vectors. There is only ever one embedding model at play. This embedding model can be at the level of word, sentence or document. Therefore, the decision to be made is which level of embedding model will be most useful. In my opinion, we should be trying them all and examine the results to select the best one.

chartgerink commented 1 month ago

for what it's worth - the 2d map as we have talked about it until now has always been at the package level. If we do it at the document level, we may get a cluster for one package, or not. 👍

paulkorir commented 1 month ago

I hear you. I believe that it may be most useful for the user to see the viz in terms of the final tool they need, not necessarily the package. In any case, it would be useful for the user to toggle between the package and module level. At the package level they will know what to install but at the module level they will know which function to run.

Bisaloo commented 1 month ago

I don't think we can identify specific functions with the initial infrastructure because the source data (= the documentation we feed to the language model) is not structured by function.

What can be done is what Dina proposed: we return the package name, and a link to the source document that lead us to return this result. From here, the user can read the document and see how they can perform their task, which will often be a combination of steps/functions.

In a future version, we can try to make a "best guess" at the function call(s) to perform the queried task but I believe it's a distinct issue. Likely something that will require using the generative feature of our language model. We can open a new issue to track this.

paulkorir commented 1 month ago

Can this be solved at the level of documentation extraction? It could be substantially easier to do this during data extraction than downstream during search.

Bisaloo commented 1 month ago

Can this be solved at the level of documentation extraction?

No, because a large portion of the source documents do not present the tool by function but by task or topic and these tasks usually involve multiple functions.

See for example https://epiforecasts.io/EpiNow2/articles/estimate_infections_workflow.html or https://epiverse-trace.github.io/finalsize/articles/finalsize.html

paulkorir commented 1 month ago

I see. That makes sense. However, I thought that the reference documentation (e.g. https://epiverse-trace.github.io/finalsize/reference/dot-final_size.html) would also be included. These would be at the function level.

paulkorir commented 1 month ago

This one is even better and it is pertinent to a single function: https://epiverse-trace.github.io/finalsize/reference/final_size.html.

Bisaloo commented 1 month ago

I thought that the reference documentation (e.g. epiverse-trace.github.io/finalsize/reference/dot-final_size.html) would also be included.

Yes, both this and the other type of document I shared are included. But it is unclear which ones will usually lead to better results. Which is why I propose we delay this specific feature until we have good results at the package level and we can identify which document (reference manual or articles/vignettes) produced these results.

Since it seems we are slightly deviating from the initial conversation, I have opened https://github.com/epiverse-connect/epiverse-search/issues/21.

In this issue, let's try to stick to the initial question: how to go from multiple documents / package to 1 point per package? Should we concatenate documents before feeding them to the LM? Should we compute embeddings per document but only return the one with the highest score? etc.

Bisaloo commented 3 weeks ago

The current approach that Avinash is using to summarise the multiple documents into a single data point is to average the embeddings.

I had a quick look at the approach using a PCA to generate the map (to be refined in https://github.com/epiverse-connect/epiverse-map/issues/10) and the various documents for a given package (one colour for each package on the plot below) are spread across the map. I'm therefore afraid that we get averages that are not representing what we want.

image

I wonder if we could have better results by:

avinashladdha commented 3 weeks ago

The primary concern with concatenating documents is that the resulting embeddings will be heavily influenced by the number of documents in each folder. Folders with more documents will have a disproportionate impact on the overall representation. I am unsure of the interpretation of two matrices when one has 5000 data points (A matrix) as compared to other with 1000 data points(B matrix) which to make them of same dimensions (required for computation in next steps) we pad it with 4000 zeros.

paulkorir commented 3 weeks ago

Noted. However, given that the search process tries to find the set of vectors which match most closely with the search vector, provided that the resulting embeddings are non-random (and they should be because of the encoded semantic content), then the number of embeddings per folder should not be a problem. It would be useful to see if the search results are distorted by a disproportionate number of embeddings.

Bisaloo commented 2 weeks ago

Folders with more documents will have a disproportionate impact on the overall representation.

I don't follow why this would be the case. As long as we have a single vector / package, all packages should have the same weight, no?