curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
699 stars 71 forks source link

Since build 1.0.38482, splitting a text into Spans is no longer deterministic. #100

Open jude-fisher-data opened 10 months ago

jude-fisher-data commented 10 months ago

Describe the bug IDocument Spans produces a list of spans within a document. This should be deterministic: splitting the same IDocument any number of times should produce the same result. Creating an IDocument with identical text should always result in the same Spans collection. This works correctly up to and including Nuget package version 1.0.38431. From v1.0.38482 to the current version it produces variable results for identical inputs with each run.

To Reproduce

Expected behavior

Sample Outputs (First few lines of identical text input - traced to Visual Studio Debug window. IDocument is created, then Spans property is accessed.)

FAULTY (Build : 1.0.38482 )

RUN A: 09:05:41:328 What We Offer 09:05:41:328 Create more personal computing. 09:05:41:578 Reinvent productivity and business processes. 09:05:41:578 Build the intelligent cloud and intelligent edge platform. 09:05:41:578 To achieve our vision, our research and development efforts focus on three interconnected ambitions: 09:05:41:578 Founded in 1975, we develop and support software, services, devices, and 09:05:41:578 solutions that deliver new value for customers and help people and businesses realize their full potential. 09:05:41:578 We're committed to making the promise of AI real and doing it responsibly. 09:05:41:578 At Microsoft, we provide technology and resources to help our customers create a secure 09:05:41:578 Our work is guided by a core set of principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. 09:05:41:578 , productive work environment.

RUN B: 09:05:42:082 What We Offer 09:05:42:082 Create more personal computing. 09:05:42:082 Build the intelligent cloud and intelligent edge platform. 09:05:42:082 Reinvent productivity and business processes. 09:05:42:082 To achieve our vision, our research and development efforts focus on three interconnected ambitions: 09:05:42:082 Founded in 1975, we develop and support software, services, devices, and solutions that deliver new value for customers and help people and businesses realize their full potential. 09:05:42:082 We offer an array of services, including cloud-based solutions that provide customers with software, services, platforms, and content, and we provide solution support and consulting services. 09:05:42:082 At Microsoft, we provide technology and resources to help our customers create a secure, productive work environment.

CORRECT (Build: 1.0.34831) Text is identical with each run: 09:14:11:865 What We Offer 09:14:11:865 Create more personal computing. 09:14:12:109 Build the intelligent cloud and intelligent edge platform. 09:14:12:109 Reinvent productivity and business processes. 09:14:12:109 To achieve our vision, our research and development efforts focus on three interconnected ambitions: 09:14:12:109 Founded in 1975, we develop and support software, services, devices, and solutions that deliver new value for customers and help people and businesses realize their full potential. 09:14:12:109 At Microsoft, we provide technology and resources to help our customers create a secure, productive work environment. 09:14:12:109 Our family of products plays a key role in the ways the world works, learns, and connects. 09:14:12:109 We're committed to making the promise of AI real and doing it responsibly. 09:14:12:109 We offer an array of services, including cloud-based solutions that provide customers with software, services, platforms, and content, and we provide solution support and consulting services.