Closed alvations closed 4 years ago
Thanks! Yes, currently it is not used in the code base. I wanted to do a binary search in the file instead of line-by-line linear search (currently the case). That would definitely reduce the amount of time it spends to search the synsets.
Can you suggest an alternative for quick searching in files?
Hashing it and mapping it into memory would improve the search. But I think reading the tab separated as DataFrames with pandas
, sframe
or dask
might be easier to handle. Reading of file can be really speedy with the C-based read_csv()
functions from DataFrame libraries and querying would be somewhat vectorized so search is faster too.
Here's an example of reading it into a dataframe and afterwards users can index and query the dataframe for any usage later:
import pandas as pd
import os
import re
import numpy as np
from pathlib import Path
USER_HOME = str(Path.home())
SYNSET_DIR = USER_HOME + '/pyiwn_data/synsets/'
gloss_example_regex = re.compile(r"(.*)\:(.*)[\"]?")
def split_gloss_example(ge):
if ge == 'nan' or type(ge) == float or ge == None:
return [], []
else:
result = re.search(gloss_example_regex, ge)
if result:
return result.group(1), result.group(2)
dfs = {}
for filename in sorted(os.listdir(SYNSET_DIR)):
if filename.startswith('all.'):
print(filename)
langname = filename[4:]
header_row = 'synset_id, lemma_names_str, gloss_example, pos'.split(', ')
df_lang = pd.read_csv(SYNSET_DIR + filename,
sep='\t',
error_bad_lines=False,
header=None,
names=header_row)
df_lang['lemma_names'] = df_lang['lemma_names_str'].astype(str).apply(lambda x: x.split(','))
df_lang['head_word'] = df_lang['lemma_names'].apply(lambda x: x[0])
df_lang['language'] = filename[4:]
df_lang['ge'] = df_lang['gloss_example'].apply(split_gloss_example)
df_lang[['gloss', 'example']] = df_lang['ge'].apply(pd.Series) # Kinda slow process...
df_lang['idx'] = df_lang['head_word'].astype(str) + '.' + df_lang['pos'].astype(str) + '.' + df_lang['synset_id'].astype(str)
dfs[langname] = df_lang
df_iwn = pd.concat(dfs.values())
df_iwn_clean = df_iwn[['language', 'synset_id', 'pos', 'head_word', 'lemma_names', 'gloss', 'example']]
@alvations This is great! Thanks a lot for the implementation. I will add it to the code.
Nice work! Just a question on
pyiwn.util.Searcher
?What does the object do? I might have missed it but I can't find the use in the code base.