UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.82k stars 2.44k forks source link

Unstable predictions from ms-marco-MiniLM-L-12-v2 #1128

Closed klimentij closed 3 years ago

klimentij commented 3 years ago

Thank you for this great library and pre-trained models! Just wanted to share our observations, because you might find it helpful if you decide to train the next version of MiniLM cross-encoder.

We had been using cross-encoder/ms-marco-electra-base for some time in production and recently moved to cross-encoder/ms-marco-MiniLM-L-12-v2. Unfortunately, we started seeing an unstable behavior when queries are proper names, so we had to downgrade back to Electra. After an investigation, we found a bunch of query-document pairs (with totally irrelevant documents), where MiniLM tends to predict a score close to 1, while Electra's score is close to 0 (which is correct).

Here are some examples:

Query (any of the following): "Happy Money", "Pedego", "Syntiant"
Document: "FreeStyle Libre: Wireless Continuous Glucose Monitoring for Diabetics
FreeStyle Libre is offering glucose monitoring using just a phone. This can eliminate blood-test finger pricks a day when monitoring glucose. This device can free patients from the hassles of glucose monitoring by using just a small sensor automatically measures and continuously stores glucose readings day and night. Blood glucose is a critical metric for diabetes sufferers, and drawing blood is very inconvenient. The FreeStyle Libre Flash Glucose Monitoring System (Abbott) is a continuous glucose monitor. It clings to your arm with a tiny sensor that penetrates the skin. This sensor communicates with a digital reader that you wave over it. Sensors last for two weeks and need replacing. From the company literature: Daily diabetes monitoring hurts. If you or someone you love has diabetes, you're probably familiar with the tedious routine of glucose monitoring, the painful fingersticks to draw a drop of blood, and the bulky traditional glucose monitoring equipment requiring daily calibrations. These inconveniences can make it difficult to stick to a diabetes management plan, opening the door for complications to arise. What if you could take the pain and inconvenience out of glucose monitoring and experience a better way of managing the condition? For the 30.3 million Americans who have diabetes, the U.S. Food and Drug Administration's approval of the FreeStyle® Libre is that life-changing experience. The revolutionary system eliminates the hurdles of traditional glucose monitoring and requires no routine fingersticks or fingerstick calibrations. Across the globe, more than 400,000 people are using the FreeStyle Libre, and the system has been clinically proven to be accurate, stable and consistent. How does continuous glucose monitoring with the FreeStyle Libre System work? The FreeStyle Libre system measures glucose levels through a small sensor — the size of two stacked quarters — applied to the back of your upper arm. It provides real-time glucose readings for up to 10 days, both day and night. The sensor can also read glucose levels through clothes, making testing discreet and convenient. The FreeStyle Libre system provides three critical pieces of data with each scan: A real-time glucose result; An eight-hour historical trend; A directional trend arrow showing where glucose levels are headed. The touch-screen reader also holds up to 90 days of data, which allows people to track their glucose levels over time. How does the FreeStyle Libre System help improve treatment? The data generated by the FreeStyle Libre system is designed to provide actionable trends and patterns that help you make better decisions about your health, such as adjustments to your diet or how much insulin you need to take. For example, the reader's snapshots can reveal if a person is experiencing hypoglycemic trends (low glucose levels) patterns or hyperglycemic trends (high glucose levels), which can aid in choosing the right diabetes management. Studies show that FreeStyle Libre users who scan more frequently spend less time in hypoglycemia and experience improved average glucose levels. According to a study published in The Lancet, people using the FreeStyle Libre system spent 38 percent less time within hypoglycemia as compared with those who managed their glucose with a traditional self-monitoring glucose system." 
Query (any of the following): "Happy Money", "Pedego", "Syntiant"
Document: "GoodLands Project: Owning Land WIth Ecological Stewardship
The GoodLands Project seeks to mobilize the Catholic Church to use its land for good. They provide information and tools to help the Church use its property wisely to enhance its ministries and missions — to care for creation, to end homelessness, to welcome the stranger, to deliver programs and services to the right places and at the right times, and to support her own fiscal sustainability. To increase the Catholic Church’s understanding and ecological planning of its landholdings using geographic technologies and community involvement to demonstrate how these lands can be a means for positive global environmental and social change. The Church potentially controls the largest nongovernmental network of landholdings in the world. It is a steward of the Earth's lands. Our common purpose as land managers and owners is to ensure that our properties help us meet our programmatic and financial goals while maximizing a positive impact on the environment and our communities. Goodlands works with you to increase the value and positive impact of your land for the benefit of your community right now and for generations to come. Goodlands' work is grounded in science, driven by design, and inspired by Christian values of stewardship and charity. Caring for your home also means that you are caring for our common home. “Molly Burhans begin using geographic or geospatial information systems, GIS, to better plan wetland restoration on individual parcels of nearby land. GIS is a general term for a tool that allows users to store, analyze, and visualize layers of information over a map. Today the technology is used in a variety of ways - to predict earthquakes and the presence of oil reserves, model the impacts of global warming, and monitor the global spread of disease. Burhans founded GoodLands to use mapping technologies to do just that. Ultimately, she hopes that the data can be used to “create a greater sense of stewardship” in communities and promote positive change in planning for environmental and social change. To date, Goodlands has mapped 35,000 Church properties in the U.S. as part of their Catholic Geographic Information Systems Center (CGISC). To gather raw data, GoodLands partnered with parishioners, academic institutions, NGOs, and other social services. Today, GoodLands is announcing a partnership with ESRI, the global leader in GIS. With ESRI’s technology, GoodLands will be able to visualize and analyze the data to present a global landscape of holdings overlaid with the structure and population distribution of the one billion-plus Catholics globally.” [Citation: Link, Accessed: 9/04/2019]" 
Query (any of the following): "Happy Money", "Pedego", "Syntiant"
Document: "Discovery and development of single-subunit RNA polymerases for efficient RNA manufacturing
RNA plays key roles in cells and organisms including as a carrier of protein coding information and as a regulator of gene expressionRNA therapeutics and RNA based vaccines which exploit the natural functions of RNA in creating physiological responses designed to prevent or treat disease have received increasing attention in the past few years particularly for developing novel vaccines and for curing rare diseases caused by heritable genetic defectsThe expanded effort in RNA therapeutics and RNA vaccines has created a new demand for RNA molecules manufactured in large quantities to precise specificationsIn particular the need to create RNA molecules >kb in length and to incorporate modified nucleotides for more efficient delivery higher stability and better clinical efficacy has compounded this manufacturing problemAlthough RNAs have been produced enzymatic ally in vitro for several decades with the use of bacteriophage RNA polymerases the enzymes traditionally used to produce RNA for Randamp D purposes are not suited for the demand ing specifications that apply to RNA molecules intended for RNA therapeuticsA new class of enzymes highly optimized for synthesis of long RNAs with specific sequences and structures need to be created to meet this new demandIn this projectPrimordial Genetics aims to express purify and characterize known but so far untested single subunit RNA polymerases that can be used as starting reagents and genetic building blocks in the development of specialized RNA manufacturing enzymesWe will test different enzymes representing the natural diversity of bacteriophage RNA polymerases for their ability to meet the critical requirements for in vitro RNA synthesis including efficient high yield RNA synthesis incorporation of non natural nucleotides and high RNA qualityThe two best enzymes will be improved by mutagen es is based on structural modeling using the structural and functional information available for this class of enzymesThe proposed work is a feasibility study for isolating and developing novel enzymes suitable for RNA manufacturing and also for creating an enzyme development pipeline that can meet the varied needs for manufacturing a diversity of RNA sequences sizes and chemical structures represented in RNA vaccines and RNA therapeutic products under developmentThe enzymes discovered and improved in this work will be directly useful for RNA manufacturing applications and can be licensed or sold to companies developing RNA vaccines and therapeutics as well as companies building RNA manufacturing capabilities Project Narrative The principal aim of this project is to develop novel and improved RNA polymerases enzymes used for manufacturing RNARNA vaccines and RNA medicines have steadily gained attention and investment as potentially revolutionary ways of protecting against disease and treating rare diseases caused by genetic defectsOne of the challenges with RNA based medicines is the production of clinical quantities of intact high quality RNA that is modified by incorporation of non natural building blocks that serve to stabilize the RNA and increase its efficacy in the human bodyDevelopment of novel RNA polymerases will help accelerate the development and production of this novel and highly promising class of therapeutics and vaccinesThis project can impact the prevention and treatment of viral diseases such as HIV and Hepatitis Band genetic diseases such as cystic fibrosis" 

For all these examples get scores around 0.96-0.99 from cross-encoder/ms-marco-MiniLM-L-12-v2 and around 0.0001 from cross-encoder/ms-marco-electra-base.

nreimers commented 3 years ago

Hi, the v2 models were trained differently and no longer output scores between 0 and 1. Instead, they output the raw logits which can be any value (but tend to be between -10 and 10).

So scores around 1 are raw low for the v2 models.

klimentij commented 3 years ago

Right, but I'm using it with sentence-transformers like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)

so as I can see in your code, it applies sigmoid to the output logits in predict.

I also have an ONNX version of this model without sigmoid, and it produces high values (around 8-10) on these pairs.

nreimers commented 3 years ago

For these models, the activation function is identity: https://github.com/UKPLab/sentence-transformers/blob/7451b0fca949721eacd37d0df6360096b6b0f222/sentence_transformers/cross_encoder/CrossEncoder.py#L66

If you use a recent version of sentence transformers.

Further, I can recommend to use the L6 version of MiniLM, it works better.

Here a colab: https://colab.research.google.com/drive/1IBWQ8oCCbeF4U-lv5Gea61JvYj2TulGd?usp=sharing

For the first doc and query, it outputs a really low score of -9.4

klimentij commented 3 years ago

Oh, I'm using version 1.0.2, that's why.

Thanks for suggesting L6, I'll give it a try!