fujimotos / polyleven

Fast Levenshtein Distance Library for Python 3
https://ceptord.net
MIT License
81 stars 11 forks source link

Question: How to apply the function for 100k datapoints? #11

Closed JDRanpariya closed 1 year ago

JDRanpariya commented 1 year ago

Is there a way to pass pandas data frame or python list?

fujimotos commented 1 year ago

Depends on what you want to do. Here is some code example:

from polyleven import levenshtein
import random
import pandas as pd

def getdata():
  return ["".join(random.choices("AGCT", k=10)) for x in range(100000)]

# Python List
dataset = getdata()
[(levenshtein(item, "ATACAAACTC")) for item in dataset]

# Pandas DataFrame
df = pd.DataFrame({"a": getdata(), "b": getdata()})
df["distance"] = df.apply(lambda x: levenshtein(x.a, x.b), axis=1)

100k entries are really not much. It will take <1s to process on a consumer CPU.