WEHI-ResearchComputing / rag

RAG toy example to build on. Copied from https://github.com/pixegami/rag-tutorial-v2
0 stars 1 forks source link

Figure out how to get the database to properly recognise author papers #5

Open edoyango opened 2 months ago

edoyango commented 2 months ago

Currently, database doesn't seem to pull papers based on author(s) e.g.

python query_data.py "Do you know of any papers authored by Edward Yang?"
Response:  Yes, based on the provided context, it is known that Edward Yang has 
co-authored a research paper titled "Numerical investigation of the mechanism of granular 
flow impact on rigid control structures". The year of submission and acceptance are 
provided as well.
Sources: ['data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:16:1', 
'data/RCP-Projects.aspx.html:None:22', 
'data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:15:12', 
'data/1-s2.0-S0266352X20300379-main-1.pdf:20:9', 'data/s11440-021-01162-4.pdf:0:0']

Which is partially right as it's one of the papers included in the dataset. Interestingly data/1-s2.0-S0266352X20300379-main-1.pdf (my other paper included in the paper) was thought to be more relevant, but not mentioned by the LLM - probably because the database returned a chunk later in the paper.

Another example:

python query_data.py "Do you know of any papers authored by Michael Milton?"
Response:  No, there is no information in the provided context that indicates if Michael Milton 
has authored any papers or not. The term "Milton" refers to a high-performance computer 
(HPC) at WEHI, not an individual author.
Sources: ['data/RCP-Projects.aspx.html:None:22', 
'data/Milton-SLURM-2022-uplift.aspx.html:None:0', 
'data/What-is-Milton.aspx.html:None:0', 
'data/RCP-AnnualSummary.aspx.html:None:20', 
'data/RCP-AnnualReport.aspx.html:None:20']

Need to figure out how to get the database to return/recognise author information.

edoyango commented 2 months ago

Tried to prepend each chunk with the article's title and authors, but didn't help at all.

Detailed example A few chunks for example (all three chunks taken from the same paper authored by Julie Iskander): ``` This chunk was taken from the article: Using biomechanics to investigate the effect of VR on eye vergence system, who's authors are: Julie Iskander, Mohammed Hossny, Saeid Nahavandi Contents lists available at ScienceDirect Applied Ergonomics journal homepage: www.elsevier.com/locate/apergo Using biomechanics to investigate the e ffect of VR on eye vergence system Julie Iskander*, Mohammed Hossny, Saeid Nahavandi Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia ARTICLE INFO Keywords: Virtual realityEye vergence movement Eye tracking Biomechanical simulationExtraocular musclesABSTRACT Vergence-accommodation con flict (VAC) is the main contributor to visual fatigue during immersion in virtual environments. Many studies have investigated the eff ects of VAC using 3D displays and expensive complex apparatus and setup to create natural and con flicting viewing conditions. However, a limited number of studies ``` ``` This chunk was taken from the article: Using biomechanics to investigate the effect of VR on eye vergence system, who's authors are: Julie Iskander, Mohammed Hossny, Saeid Nahavandi targeted virtual environments simulated using modern consumer-grade VR headsets. Our main objective, in this work, is to test how the modern VR headsets (VR simulated depth) could a ffect our vergence system, in addition to investigating the eff ect of the simulated depth on the eye-gaze performance. The virtual scenario used in- cluded a common virtual object (a cube) in a simple virtual environment with no constraints placed on the headand neck movement of the subjects. We used ocular biomechanics and eye tracking to compare between ver-gence angles in matching (ideal) and con flicting (real) viewing conditions. Real vergence angle during im- mersion was signi ficantly higher than ideal vergence angle and exhibited higher variability which leads to ``` ``` This chunk was taken from the article: Using biomechanics to investigate the effect of VR on eye vergence system, who's authors are: Julie Iskander, Mohammed Hossny, Saeid Nahavandi incorrect depth cues that a ffects depth perception and also leads to visual fatigue for prolonged virtual ex- periences. Additionally, we found that as the simulated depth increases, the ability of users to manipulate virtual objects with their eyes decreases, thus, decreasing the possibilities of interaction through eye gaze. The bio- mechanics model used here can be further extended to study muscular activity of eye muscles during immersion.It presents an efficient and flexible assessment tool for virtual environments. 1. Introduction Virtual reality (VR) headsets have become more a ffordable and accessible to a broader population that includes young adults and children. In a few years, it turned from expensive devices that needed ``` But when querying: ```bash python query_data.py "Can you give me the title of any papers authored by Julie Iskander please" ``` The corresponding prompt with the "relevant" chunks: ``` Human: Answer the question based only on the following context: This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN vol. 36, no. 3, pp. 321–328, 2003. [184] T. S. Buchanan, D. G. Lloyd, K. Manal, and T. F. Besier, ‘‘Estimation of muscle forces and joint moments using a forward-inverse dynamics model,’’ Med. Sci. Sports Exercise , vol. 37, no. 11, pp. 1911–1916, 2005. JULIE ISKANDER received the B.Sc. and M.Sc. degrees in electrical engineering from Alexandria University, Egypt, in 2004 and 2009, respec- tively. She is currently pursuing the Ph.D. degree with the Institute for Intelligent Systems Research and Innovation, Deakin University. She was with the Information Technology Institute as a Teach- ing Assistant, then a Software Development Department Head, and then as a Branch Man- ager. Her research interests include neuromuscular --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN ager. Her research interests include neuromuscular modeling and ocular motility biomechanics. In addition, she focuses on analyzing and differentiating different mental states from eye tracking. 19360 VOLUME 6, 2018 --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN Figure 3: Fields of research in which respondents work. Multiple options could beselected. Topics included under ‘Other’ were: proteomics, venoms, mass spec omics data,epigenomics, conservation, metagenome, environment, ecology, molecular nutrition,computational biochemistry, transcriptomics, wastewater treatment, epigenetics,microbiology, food science 7 --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN (P.M.L.), phylogenetic analysis (P.M.L., C.G., and M.M.), protein structuralmodeling (R.G., P.M.L., and C.G.), gas chromatography measurements(P.M.L., E.T-M., and L.J.), H 2uptake kinetic characterization (P.M.L.), archaeal survival assay (P.M.L.), biochemical characterization (J.P.L., P.M.L., R.G., and A.K.), shotgun proteomics (P.M.L., H.L., I.H., E.T., and R.B.S.), andecological theory (P.M.L., C.G., M.B.S., C.R.C. and H.A.P.). P.M.L., C.G., andR.G. analyzed data and wrote the manuscript with inputs from all authors. Competing interests The authors declare no competing interests. Additional information Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41467-024-47324-2 . --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN Vice-Chancellor (Defence Technologies), the Chair of Engineering, and the Director of the Institute for Intelligent Systems Research and Innovation with Deakin University. He has authored or co-authored over 600 papers in various international journals and conferences. His research interests include the modeling of complex systems, robotics, and haptics. He is a fellow of Engineers Australia (FIEAust) and the Institution of Engineering and Technology. He is the Co-Editor-in-Chief of IEEE S YSTEMS JOURNAL , an Associate Editor of IEEE/ASME T RANSACTIONS ON MECHATRONICS , an Associate Editor of IEEE T RANSACTIONS ON SYSTEMS , M AN AND CYBERNETICS : SYSTEMS , and an IEEE Access Editorial Board Member. VOLUME 6, 2018 19361 --- Answer the question based on the above context: Can you give me the title of any papers authored by Julie Iskander please ``` And Ollama (Mistral) answers: ``` Response: Unfortunately, the provided context does not contain information about the titles of papers authored by Julie Iskander. Sources: ['data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:15:12', 'data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:15:13', 'data/1_Australian_bioinformatics_training_needs_survey_2021_22_Report.pdf:7:0', 'data/s41467-024-47324-2.pdf:16:4', 'data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:16:1'] ```

When querying the database for papers authored by Julie Iskander, the Chroma DB similarity search failed to notice that "Julie Iskander" was in the prepended author list.

Changing the surrounding text didn't change anything either e.g. printing a dict rather than a sentence didn't really help.

Alternatively, I could probably use metadata filtering instead: https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/#filtering-on-metadata

edoyango commented 2 months ago

I wanted to try other models that might perform better with publication documents (e.g. specter2.

This led me to try and use langchain_huggingface.embeddingsHuggingFaceEmbeddings instead of langchain_community.embeddings.ollama.OllamaEmbeddings because Ollama has pretty limited compatibility with models ref. HuggingFaceEmbeddings works for more models. These are on a seperate branch (https://github.com/WEHI-ResearchComputing/rag/tree/ollama-to-hf)

Looking at the MTEB leaderboard, I tried Alibaba-NLP/gte-large-en-v1.5, which gave me better results.

Prompt: ``` Human: Answer the question based only on the following context: This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN musculoskeletal modeling and simulation framework for in silico investi- gations and exchange,’’ Procedia IUTAM , vol. 2, pp. 212–232, Jan. 2011. [182] D. G. Thelen and F. C. Anderson, ‘‘Using computed muscle control to generate forward dynamic simulations of human walking from experi- mental data,’’ J. Biomech. , vol. 39, no. 6, pp. 1107–1115, 2006. [183] D. G. Thelen, F. C. Anderson, and S. L. Delp, ‘‘Generating dynamic simulations of movement using computed muscle control,’’ J. Biomech. , vol. 36, no. 3, pp. 321–328, 2003. [184] T. S. Buchanan, D. G. Lloyd, K. Manal, and T. F. Besier, ‘‘Estimation of muscle forces and joint moments using a forward-inverse dynamics model,’’ Med. Sci. Sports Exercise , vol. 37, no. 11, pp. 1911–1916, 2005. JULIE ISKANDER received the B.Sc. and M.Sc. degrees in electrical engineering from Alexandria University, Egypt, in 2004 and 2009, respec- --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN University, Egypt, in 2004 and 2009, respec- tively. She is currently pursuing the Ph.D. degree with the Institute for Intelligent Systems Research and Innovation, Deakin University. She was with the Information Technology Institute as a Teach- ing Assistant, then a Software Development Department Head, and then as a Branch Man- ager. Her research interests include neuromuscular modeling and ocular motility biomechanics. In addition, she focuses on analyzing and differentiating different mental states from eye tracking. 19360 VOLUME 6, 2018 --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN delivery method for reaching dispersed and distant trainees.PLOS Comput. Biol.17, e1008715 (2021).5.Unsworth, Kathrynet al.DReSA: Project team reflections.(2021)doi:10.5281/ZENODO.5712128.6.Beard, N.et al.TeSS: a platform for discoveringlife-science training opportunities.Bioinformatics36, 3290–3291 (2020). --- This chunk was taken from the article: UNKNOWN, who's authors are: UNKNOWN Institute for Intelligent Systems Research and Innovation with Deakin University. He has authored or co-authored over 600 papers in various international journals and conferences. His research interests include the modeling of complex systems, robotics, and haptics. He is a fellow of Engineers Australia (FIEAust) and the Institution of Engineering and Technology. He is the Co-Editor-in-Chief of IEEE S YSTEMS JOURNAL , an Associate Editor of IEEE/ASME T RANSACTIONS ON MECHATRONICS , an Associate Editor of IEEE T RANSACTIONS ON SYSTEMS , M AN AND CYBERNETICS : SYSTEMS , and an IEEE Access Editorial Board Member. VOLUME 6, 2018 19361 --- This chunk was taken from the article: Using biomechanics to investigate the effect of VR on eye vergence system, who's authors are: Julie Iskander, Mohammed Hossny, Saeid Nahavandi eye and head tracking data. The paper is organised as follows. Section 2describes the experi- mental procedure, the participants and the apparatus used. The visualtask and the biomechanical simulation are also described here. Section 3presents the statistical analysis results of the di fferent tests performed. Andfinally a discussion is presented in Section 4and a conclusion in Section 5. 2. Methods In this section, we describe the design of the experiment and the processing and analysis done on the collected data. 2.1. Participants Twenty six subjects participated in the study with no physical or Fig. 2. (a) The VR scene used in the experiment with the eye gaze points overlain as green circles. The green circles were not shown to the user, to avoid distraction. --- Answer the question based on the above context: can you give me papers authored by Julie Iskander? ``` Response: ``` Based on the provided context, the paper(s) authored by Julie Iskander are: 1. Using biomechanics to investigate the effect of VR on eye vergence system (co-authored with Mohammed Hossny and Saeid Nahavandi) The paper can be found in Volume 6, 2018, but the specific page number is not provided in the context. You may need to search for the title or authors to find the exact location of this paper. Sources: ['data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:15:10', 'data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:15:11', 'data/1_Australian_bioinformatics_training_needs_survey_2021_22_Report.pdf:16:1', 'data/A_Review_on_Ocular_Biomechanic_Models_for_Assessing_Visual_Fatigue_in_Virtual_Reality.pdf:16:1', 'data/1-s2.0-S0003687018302904-main.pdf:2:3'] ```

Unlike using mxbai-embed-large-v1, A relevant chunk with the added author information was pulled. But there were two papers that I annotated, so it was only half right (better than the previous models though).

Salesforce/SFR-Embedding-Mistral didn't too well despite being higher ranked and larger than Alibaba-NLP/gte-large-en-v1.5.

edoyango commented 2 months ago

Ok I've now understood that to use HuggingFaceEmbeddings, the models have to have a sentence-transformer model available. If it doesn't langchain will convert the model to a sentence-transformer, but needs to be trained (i.e., will produce nonsense)