freeseek / score

Tools to work with GWAS-VCF summary statistics files
MIT License
102 stars 8 forks source link

Too many open files in temp, could not read temp #15

Open NiharikaCUSB opened 3 months ago

NiharikaCUSB commented 3 months ago

I am running bcftools +liftover to convert a vcf file generated with hg19 to hg38 Got these errors/warnings: Merging 1466 temporary files [E::hts_open_format] Failed to open file "TempDk63j4/01022.bcf" : Too many open files Could not read TempDk63j4/01022.bcf: Too many open files [W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr1:1654143 [E::faidx_adjust_position] The sequence "X" was not found Unable to fetch sequence at X:200820--1

I've used hg19, hg38 and chain file from UCSC I can not paste any snippet of vcf as it might raise a copyright issue

freeseek commented 3 months ago

It looks like you have used GRCh37 rather than hg19. At least for the --src-fasta-ref you need to use the correct reference genome. See here for how to install the reference genome for GRCh37. It also seems like you have a 1024 limit on the number of files you can open. If you run the command ulimit -n what result do you get? One solution would be to allow bcftools sort to create smaller file shards. This will require more memory but it will reduce the number of shards which in your case is 1466. You can use the option --max-mem 2G to obviate that

NiharikaCUSB commented 1 month ago

Thank you for your reply, I followed as suggested changed the reference genome set the limit of open files to 4096 ulimit -n returns 4096 and used --max-mem 450G

Still I am running into same problem

Warning: source contig chrM has length 16569 in the VCF and length 16571 in the chain file INFO/AC is handled by AC rule INFO/AF is handled by AF rule INFO/MLEAC is handled by AGR rule INFO/MLEAF is handled by AGR rule FORMAT/PL is handled by AGR rule [W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr13:98440840 Lines total/swapped/reference added/rejected: 56480/5/1/0 Merging 56480 temporary files [E::hts_open_format] Failed to open file "Tempd88MRv/04094.bcf" : Too many open files Could not read Tempd88MRv/04094.bcf: Too many open files Cleaning

Note: my input vcf is of 200GB, However, I've splitted the vcf chromosome wise and got 23 smaller vcfs.

freeseek commented 1 month ago

The lines:

Lines total/swapped/reference added/rejected: 56480/5/1/0
Merging 56480 temporary files

Indicate that you created as many temporary files as the number of variants so some other problem must be at play. It does not seem that the option --max-mem 450G is really being enabled. Can you share the exact command line you used? BCFtools/sort encodes the maximum amount of memory using the following function:

size_t parse_mem_string(const char *str)
{
    char *tmp;
    double mem = strtod(str, &tmp);
    if ( tmp==str ) error("Could not parse the memory string: \"%s\"\n", str);
    if ( !strcasecmp("k",tmp) ) mem *= 1000;
    else if ( !strcasecmp("m",tmp) ) mem *= 1000*1000;
    else if ( !strcasecmp("g",tmp) ) mem *= 1000*1000*1000;
    return mem;
}

I am wondering if it is possible that size_t is a 32 bit integer rather than a 64 bit integer on your system. Who compiled BCFtools in your installation? Can you check whether the BCFtools binary is a 32 bits or a 64 bits binary?

Also notice that the issue of too many open files was reported and fully resolved in BCFtools 1.21