cfiltnlp / pyiwn

A Python based API to access Indian language WordNets.
http://www.cfilt.iitb.ac.in/
Creative Commons Attribution Share Alike 4.0 International
34 stars 19 forks source link

Searcher usage #1

Closed alvations closed 4 years ago

alvations commented 6 years ago

Nice work! Just a question on pyiwn.util.Searcher?

What does the object do? I might have missed it but I can't find the use in the code base.

riteshpanjwani commented 6 years ago

Thanks! Yes, currently it is not used in the code base. I wanted to do a binary search in the file instead of line-by-line linear search (currently the case). That would definitely reduce the amount of time it spends to search the synsets.

Can you suggest an alternative for quick searching in files?

alvations commented 6 years ago

Hashing it and mapping it into memory would improve the search. But I think reading the tab separated as DataFrames with pandas, sframe or dask might be easier to handle. Reading of file can be really speedy with the C-based read_csv() functions from DataFrame libraries and querying would be somewhat vectorized so search is faster too.

alvations commented 6 years ago

Here's an example of reading it into a dataframe and afterwards users can index and query the dataframe for any usage later:

import pandas as pd
import os
import re
import numpy as np
from pathlib import Path

USER_HOME = str(Path.home())
SYNSET_DIR = USER_HOME + '/pyiwn_data/synsets/'

gloss_example_regex = re.compile(r"(.*)\:(.*)[\"]?")

def split_gloss_example(ge):
    if ge == 'nan' or type(ge) == float or ge == None:
        return [], []
    else:
        result = re.search(gloss_example_regex, ge)
        if result:
            return result.group(1), result.group(2)

dfs = {}

for filename in sorted(os.listdir(SYNSET_DIR)):
    if filename.startswith('all.'):
        print(filename)
        langname = filename[4:]
        header_row = 'synset_id, lemma_names_str, gloss_example, pos'.split(', ')
        df_lang = pd.read_csv(SYNSET_DIR + filename, 
                              sep='\t', 
                              error_bad_lines=False, 
                              header=None, 
                              names=header_row)

        df_lang['lemma_names'] = df_lang['lemma_names_str'].astype(str).apply(lambda x: x.split(','))
        df_lang['head_word'] = df_lang['lemma_names'].apply(lambda x: x[0])
        df_lang['language'] = filename[4:]

        df_lang['ge'] = df_lang['gloss_example'].apply(split_gloss_example)
        df_lang[['gloss', 'example']] = df_lang['ge'].apply(pd.Series) # Kinda slow process...
        df_lang['idx'] = df_lang['head_word'].astype(str) + '.' + df_lang['pos'].astype(str) + '.' + df_lang['synset_id'].astype(str)
        dfs[langname] = df_lang

df_iwn = pd.concat(dfs.values())
df_iwn_clean = df_iwn[['language', 'synset_id', 'pos', 'head_word', 'lemma_names', 'gloss', 'example']]
riteshpanjwani commented 6 years ago

@alvations This is great! Thanks a lot for the implementation. I will add it to the code.