ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖
https://ashvardanian.com/posts/stringzilla/
Apache License 2.0
2.05k stars 66 forks source link

[Query] to read a large text file with delimited of size 75 GB #65

Closed crbsram closed 6 months ago

crbsram commented 8 months ago

Hi Ash Sorry for the Spam if the query is answered already. Here is my usecase. I have file of size 75GB which has 500mil lines. For every line i need to scan 500mil lines to run some checks and prepare a result. sorting is not required in our case. we have a modern hardware with 64 cores and 512gb RAM with nvme ssd with 10GB/sec IOPS. I would like to leverage the featurs of modern hardware than the traditional approach. Infact I have loaded these in a postgres DB but running these with 100 parallel connection takes huge time to iterate on 500mil records. After looking at the talk on youtube, want to explore that stringzilla framework will be fit for our usecase. Please let us know any pointers on the same. Also if it is possible through stringzilla, are there any sample reference implementation to start with please let me know. Thanks

Thanks C.R.Bala

ashvardanian commented 8 months ago

Hi, @crbsram, thanks for reaching out! Everything depends on the complexity of your checks, but I don't see a reason why this wouldn't work with StringZilla.

How long are your rows on average? Does your CPU support AVX-512?

crbsram commented 8 months ago

Hi Ash every line is of 3000 characters (max) and for every line i will have to scan the entire 500mil lines and CPU is Intel Xeon Gold 6330. Every line is delimited with special character. For every line we will have to iterate it in all 500 million in quicker way. Henceforth the ask is is stringzilla will help us to achieve this feat by leveraging the capabilities of modern hardware.

Thanks C.R.Bala

ashvardanian commented 8 months ago

Yes, @crbsram! This sounds like a good case for StringZilla. I'll have to warn you that even with StringZilla 5e6 x 5e6 pairwise string operations won't be very fast. I recommend setting up a smaller benchmark first. Let me know if this works out.

What is the larger goal you are trying to achieve? Is that search?

crbsram commented 8 months ago

Hi @ashvardanian, Then let me setup a smaller bench mark with 1mil lines and 10 mil lines and provide the result. THe larger goal is to find string similarity between two strings using trigram method Hence we don't have choice other than iterating the entire lines in a file.

Thanks C.R.Bala

ashvardanian commented 7 months ago

Hi @crbsram! How is it going? Any preliminary results?