attractivechaos / klib

A standalone and lightweight C library
http://attractivechaos.github.io/klib/
MIT License
4.18k stars 556 forks source link

khash.h seems can not hold more than 40M different string keys #185

Closed wulj2 closed 1 month ago

wulj2 commented 1 month ago

Thank you for your excellent library. I recently used the khash.h and kstring.h to code up a little program to get unique lines in one of two files. It works on most cases without any bug. However, today I try to use the little program to get uniq lines in two big files, each of them is more than 40M lines with file size about 2.5G, then it crashed and tell about segment fault, I try to use gdb to debug, and get nothing, is there any limit on the max count of khash?

my little program source code axb.c is below, all the source file khash.h and kstring.h, kstring.c are the latest version in the repo, I compile with gcc -std=c11 -O3 -o axb axb.c kstring.c

#include "khash.h"
#include "kstring.h"

KHASH_SET_INIT_STR(s4s)

int main(int argc, char** argv){
    if(argc < 3){
        fprintf(stderr, "get line unique in file a\n");
        fprintf(stderr, "usage: %s <a> <b>\n", argv[0]);
        return 0;
    }
    kh_s4s_t* sb = kh_init_s4s();
    kstring_t line = {0, 0, 0};
    int hret = 0;
    // get b
    FILE* fr = fopen(argv[2], "r");
    while(line.l = 0, kgetline(&line, (kgets_func*)fgets, fr) >= 0){
        khiter_t itr = kh_get_s4s(sb, line.s);
        if(itr == kh_end(sb)){
            kh_put_s4s(sb, strndup(line.s, line.l), &hret);
        }
    }
    fclose(fr);
    // do diff
    fr = fopen(argv[1], "r");
    while(line.l = 0, kgetline(&line, (kgets_func*)fgets, fr) >= 0){
        if(kh_get_s4s(sb, line.s) == kh_end(sb)){
            fprintf(stderr, "%s\n", line.s);
        }
    }
    fclose(fr);
    // release
    if(line.s) free(line.s);
    for(khiter_t itr = kh_begin(sb); itr < kh_end(sb); ++itr){
        if(kh_exist(sb, itr)){
            free((char*)kh_key(sb, itr));
            kh_del_s4s(sb, itr);
        }
    }
    kh_destroy_s4s(sb);
}
wulj2 commented 1 month ago

the input file which trigger the bug are too large, i compressed them and uploaded to google drive, here input1 and input2

wulj2 commented 1 month ago

sorry, I think google city hash can handle these files, it solves my problem.