activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Apache License 2.0
8.16k stars 625 forks source link

[BUG] Querying a string column is very slow #2972

Open moonvalley-matt opened 3 weeks ago

moonvalley-matt commented 3 weeks ago

Severity

P1 - Urgent, but non-breaking

Current Behavior

I have a dataset of ~1M rows that has a column of np.str_ in the metadata. It takes 4 seconds / 1000 records to load this column, while it takes seconds for 1,000,000 records for integer columns.

Steps to Reproduce

Create a dataset of 1,000,000 rows with a metadata of a mixture of strings and integers.

Expected/Desired Behavior

Strings should load approximately as fast as integers, otherwise are there other recommendations? Trying to understand the nature of the problem

Python Version

No response

OS

No response

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

davidbuniat commented 2 weeks ago

Thanks @moonvalley-matt, I believe also reported by our users, if you change to htype=="text", then speed should be much faster.

@levonohanyan is looking into making the performance uniformly fast across all string types.

levonohanyan commented 2 weeks ago

Hi @moonvalley-matt,

Seems the issue is not generally reproducible and depends on the specific version of deeplake, python or numpy. Can you please provide more details about the versions you used. If there’s a reproducible script that’d be better.

Regards, Levon