N-gram Algorithm - Githubissues

DevShivmohan commented 1 year ago

implement n-gram algorithm for search in MySQL

Create table with enabling fulltext search for two columns name and bio.
For more reference click here

CREATE TABLE characters (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  bio VARCHAR(255) NOT NULL,
  FULLTEXT (name,bio) WITH PARSER ngram
);

Insert some records.
Than fetch records by searching the query. syntax

select * from characters where match(col_1,col_2,col_n) against(<search_keyword>);

examples

select * from characters where match(name,bio) against('potter');

and this query also return records with greater matches in descending order

SELECT id, name, bio,
    MATCH(name,bio) 
    AGAINST('harry') as score
FROM characters 
WHERE 
    MATCH(name,bio) 
    AGAINST('harry')  ORDER BY score DESC;

In MySQL by default

the default token_size is set to 2 by MySQL, the following can be changed in /etc/mysql/my.cnf under the below mentioned property.

[mysqld]
innodb_ft_min_token_size = 2

In Java n-gram implementation

/**
     * 
     * @param tokenSize n-gram(where n=1,2,3,....n)
     * @param keyword any search keyword
     * @return
     */
    public Set<String> nGramTokens(int tokenSize,String keyword){
        Set<String> ngrams=new HashSet<>();
        for(int i=0;i<=keyword.length()-tokenSize;i++)
            ngrams.add(keyword.substring(i,i+tokenSize));
        return ngrams;
    }

For example purpose data set


CREATE TABLE characters (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(100) NOT NULL,
  bio VARCHAR(255) NOT NULL,
  FULLTEXT (name,bio) WITH PARSER ngram
);

INSERT INTO characters VALUES
(1,'Regulus Arcturus Black','Brother of Sirius. Used to be a Death Eater but defected.'),
(2,'Sirius Black','Best friend of James Potter and godfather of Harry.'),
(3,'Lavender Brown','Killed by a werewolf. She was a gryffindor student who dated Ron. '),
(4,'Cho Chang','Ravenclaw student who dated Cedric Diggory and Harry Potter.'),
(5,'Vincent Crabbe Sr.','Father of Crabbe and death-eater who escaped Azkaban.'),
(6,'Vincent Crabbe','Slytherin student who was best friends with Goyle and followed Draco.'),
(7,'Bartemius \"Barty\" Crouch Sr.','Head of the department of Internation Magical Cooperation. Killed by his son.'),
(8,'Bartemius \"Barty\" Crouch Jr.','Death Eater who impersonated Alastor Moody.'),
(9,'Fleur Delacour','Participated in the Triwizard tournament and married Bill Weasley.'),
(10,'Cedric Diggory','Participated in the Triwizard tournament and got killed by Voldemort.'),
(11,'Alberforth Dumbledore','Albus'' brother and owner of Hog''s Head.'),
(12,'Albus Dumbledore','Headmaster of Hogwards killed by Snape.'),
(13,'Dudley Dursley','Muggle son of Vernon and Petunia and first-cousin of Harry.'),
(14,'Petunia Dursley','Harry''s aunt and sister of Lily.'),
(15,'Vernon Dursley','Harry''s muggle uncle.'),
(16,'Argus Filch','Squib caretake of Hogwards.'),
(17,'Seamus Finnigan','Harry''s friend and member of Dumbledore''s army.'),
(18,'Nicolas Flamel','Creator of the Philosopher''s Stone.'),
(19,'Cornelius Fudge','Minister of Magic that was forced to resign.'),
(20,'Goyle Sr.','Death Eater and father of Gregory Goyle.'),
(21,'Gregory Goyle','Best friend of Crabbe. Slytherin student and dies by falling into Fiendfyre.'),
(22,'Hermione Granger','One of Harry''s best friend and marries Ron Weasley.'),
(23,'Rubeus Hagrid','Half-giant who loves Harry. He was the keeper of Keys and Grounds at Hogwards.'),
(24,'Igor Karkaroff','Highmaster of Durmstrang and reformed death-eater.'),
(25,'Viktor Krum','Participant in the Triwizard tournament. Dated Hermione.'),
(26,'Bellatrix Lestrange','Death Eater who was killed by Molly Weasley.'),
(27,'Alice Longbottom','Mother of Neville who was tortured by Bellatrix.'),
(28,'Frank Longbottom','Father of Neville who was tortured by Bellatrix.'),
(29,'Neville Longbottom','Gryffindor student who was a member of Dumbledore''s army.'),
(30,'Luna Lovegood','Ravenclaw student who was a member of Dumbledore''s army.'),
(31,'Xenophilius Lovegood','Father of Luna and editor of The Quibbler.'),
(32,'Remus Lupin','Friend of James Potter and werewolf. He married Nymphadora.'),
(33,'Draco Malfoy','Slytherin student who had many arguments with Harry.'),
(34,'Lucius Malfoy','Father of Draco and influential Death-Eater.'),
(35,'Narcissa Malfoy','Mother of Draco and sister of Bellatrix.'),
(36,'Olympe Maxime','Half-giantess and headmistress of Beauxbatons.'),
(37,'Minerva McGonagall','Professor of Transfiguration and head of Gryffindor.'),
(38,'Alastor \"Mad-Eye\" Moody','Retired auror and member of the order of the Phoenix. Killed by Voldemort.'),
(39,'Peter Pettigrew','Betrays James and Lily Potter. Follower of Voldemort.'),
(40,'Harry Potter','The boy who lived. Main character of the series.'),
(41,'James Potter','Father of Harry. Murdered by Voldemort.'),
(42,'Lily Potter','Mother of Harry. Murdered by Voldemort.'),
(43,'Quirinus Quirrell','Possessed by Voldemort. Defence against the Dark Arts professor.'),
(44,'Tom Riddle Sr.','Muggle father of Voldemort who was killed by him.'),
(45,'Mary Riddle','Muggle mother of Voldemort who was killed by him.'),
(46,'Lord Voldemort','The antagonist of the series who murdered many.'),
(47,'Rita Skeeter','Reporter at the Daily Prophet.'),
(48,'Severus Snape','Head of the Slytherin house and saved Harry in many occasions.'),
(49,'Nymphadora Tonks','Married Remus Lupin and was killed by Bellatrix.'),
(50,'Dolores Janes Umbridge','Senior undersecretary to the Ministry of Magic. Eventually sent to Azkaban.'),
(51,'Arthur Weasley','Father of the Weasleys and member of the Order of the Phoenix.'),
(52,'Bill Weasley','Oldest son of Arthur and Molly. Husband of Fleur. '),
(53,'Charlie Weasley','Second son of Arthur and Molly. Works with dragons in Romania.'),
(54,'Fred Weasley','Identical twin with George and co-owner of Weasleys'' Wizard Wheezes'),
(55,'George Weasley','Identical twin with Fred and co-owner of Weasleys'' Wizard Wheezes'),
(56,'Ginny Weasley','Marries Harry Potter and only daughter of Molly and Arthur.'),
(57,'Molly Weasley','Wife of Arthur and mother of the Weasleys. Kills Bellatrix.'),
(58,'Percy Weasley','Third son of Arthur and Molly. He is a Gryffindor prefect.'),
(59,'Ron Weasley','Harry''s best friend. Marries Hermione.'),
(60,'Dobby','House elf and friend of Harry. He is killed by Bellatrix.'),
(61,'Fluffy','Three-headed dog belonging to Rubeus Hagrid.'),
(62,'Hedwig','Harry''s owl.'),
(63,'Moaning Myrtle','Ghost at Hogwards.'),
(64,'Aragog','Acromantula belonging to Rubeus Hagrid.'),
(65,'Grawp','Giant-half brother of Hagrid.');

Enabling n-gram(tri_gram) in PostgreSQL

Enable tri-gram extension by cmd CREATE EXTENSION pg_trgm;.
Applying tri-gram syntax select <column_1>,<column_2>,<column_n> from <table_name> order by SIMILARITY(<column_1> || ' ' || <column_2> || ' ' || <column_n>,<search_string>) desc;
Example select * from test order by SIMILARITY(name || ' ' || bio,'HARRY') desc;

DevShivmohan commented 1 year ago

Example of n-gram implementation as a student search via his name


import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;

import java.util.*;

@Data
@AllArgsConstructor
@NoArgsConstructor
class Student{
    private int id;
    private String name;
    private String address;
}

@Data
@AllArgsConstructor
@NoArgsConstructor
class FilterStudents{
    private Student student;
    private int priority;
}

public class NGramSearch{

    private List<Student> students=new ArrayList<>();
    public static void main(String[] args) {
        var rand=new NGramSearch();
        rand.filterStudents("Shvimohn").forEach(System.out::println);
    }

    /**
     *
     * @param tokenSize n-gram(where n=1,2,3,....n)
     * @param keyword any search keyword
     * @return
     */
    public Set<String> nGramTokens(int tokenSize,String keyword){
        Set<String> ngrams=new HashSet<>();
        for(int i=0;i<=keyword.length()-tokenSize;i++)
            ngrams.add(keyword.substring(i,i+tokenSize));
        return ngrams;
    }

    public List<FilterStudents> filterStudents(String address){
        var tokens=nGramTokens(2,address);
        List<FilterStudents> filterStudents=new ArrayList<>();
        students.parallelStream().forEach((student -> {
            var filteredStudent=new FilterStudents(student,0);
            tokens.parallelStream().forEach((token->{
                if(student.getName().toLowerCase().contains(token.toLowerCase()))
                    filteredStudent.setPriority(filteredStudent.getPriority()+1);
            }));
            filterStudents.add(filteredStudent);
        }));
        filterStudents.sort(Comparator.comparing(FilterStudents::getPriority).reversed());
        return filterStudents;
    }

    public NGramSearch(){
        students.add(new Student(1,"Shivmohan","Siddharth nagar"));
        students.add(new Student(2,"Vipul","Varansi"));
        students.add(new Student(3,"Ram kumar","Sinch in india"));
        students.add(new Student(4,"Deepak","Lalpur naugarh"));
        students.add(new Student(5,"Raman","Buddh nagar india"));
        students.add(new Student(6,"Shivmohan","Kushi nagar"));
        students.add(new Student(7,"d","Nakha jungle"));
        students.add(new Student(8,"fr","Kushi nagar"));
        students.add(new Student(9,"se","Dift nagar"));
        students.add(new Student(10,"sddsd","fatehpur"));
    }
}

DevShivmohan commented 1 year ago

Hapi fhir postgres query

-- select * from hfj_spidx_string;

select hss.sp_id,hss.res_id,hss.sp_name,hss.sp_value_exact,hst.sp_name,hst.sp_value from hfj_spidx_string as hss
join hfj_spidx_token as hst on hss.res_id=hst.res_id
where hss.sp_name in ('given','name') 
and hst.sp_name in ('gender')
and hss.res_type='Patient'
order by SIMILARITY(hss.sp_value_exact || '' || hss.res_id || '' || hst.sp_value,'jam' || '' || 056 || '' || 'male') desc;

Sub query

select hrv.res_id,hrv.res_text_vc from hfj_res_ver as hrv join hfj_spidx_string as hss on hrv.res_id=hss.res_id
join hfj_spidx_token as hst on hss.res_id=hst.res_id
where hss.sp_name in ('given','name','address-postalcode','address-state','address','address-city') 
and hst.sp_name in ('gender')
and hss.res_type='Patient'
order by SIMILARITY(hss.sp_value_exact || '' || hss.res_id || '' || hst.sp_value,
                    'jam' || '' || 056 || '' || 'male' || '' || 'Siddharth nagar') desc;

Final query

select hss.sp_name,hss.sp_value_exact,hrv.res_id,hrv.res_text_vc from hfj_res_ver as hrv join
hfj_spidx_string as hss on hss.res_id=hrv.res_id join
hfj_spidx_token as hst on hst.res_id=hrv.res_id join
hfj_spidx_date as hsd on hsd.res_id=hrv.res_id
where hss.sp_name in ('given','name','address-postalcode','address','address-city') -- replace with blank quote
and hst.sp_name='gender' and hst.sp_value='male'
and hss.res_type='Patient'
and hsd.sp_value_low between '1900-02-01' and '2000-05-04' and hsd.sp_name='birthdate'
order by SIMILARITY(hss.sp_value_exact,'shiv') desc; -- concat with all values in like shiv

DevShivmohan / Learning-everything

N-gram Algorithm #30

implement n-gram algorithm for search in MySQL

In Java n-gram implementation

Enabling n-gram(tri_gram) in PostgreSQL

Example of n-gram implementation as a student search via his name

Hapi fhir postgres query

Sub query

Final query