Open NiharikaCUSB opened 3 months ago
It looks like you have used GRCh37 rather than hg19. At least for the --src-fasta-ref
you need to use the correct reference genome. See here for how to install the reference genome for GRCh37. It also seems like you have a 1024 limit on the number of files you can open. If you run the command ulimit -n
what result do you get? One solution would be to allow bcftools sort
to create smaller file shards. This will require more memory but it will reduce the number of shards which in your case is 1466
. You can use the option --max-mem 2G
to obviate that
Thank you for your reply, I followed as suggested changed the reference genome set the limit of open files to 4096 ulimit -n returns 4096 and used --max-mem 450G
Still I am running into same problem
Warning: source contig chrM has length 16569 in the VCF and length 16571 in the chain file INFO/AC is handled by AC rule INFO/AF is handled by AF rule INFO/MLEAC is handled by AGR rule INFO/MLEAF is handled by AGR rule FORMAT/PL is handled by AGR rule [W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr13:98440840 Lines total/swapped/reference added/rejected: 56480/5/1/0 Merging 56480 temporary files [E::hts_open_format] Failed to open file "Tempd88MRv/04094.bcf" : Too many open files Could not read Tempd88MRv/04094.bcf: Too many open files Cleaning
Note: my input vcf is of 200GB, However, I've splitted the vcf chromosome wise and got 23 smaller vcfs.
The lines:
Lines total/swapped/reference added/rejected: 56480/5/1/0
Merging 56480 temporary files
Indicate that you created as many temporary files as the number of variants so some other problem must be at play. It does not seem that the option --max-mem 450G
is really being enabled. Can you share the exact command line you used? BCFtools/sort encodes the maximum amount of memory using the following function:
size_t parse_mem_string(const char *str)
{
char *tmp;
double mem = strtod(str, &tmp);
if ( tmp==str ) error("Could not parse the memory string: \"%s\"\n", str);
if ( !strcasecmp("k",tmp) ) mem *= 1000;
else if ( !strcasecmp("m",tmp) ) mem *= 1000*1000;
else if ( !strcasecmp("g",tmp) ) mem *= 1000*1000*1000;
return mem;
}
I am wondering if it is possible that size_t
is a 32 bit integer rather than a 64 bit integer on your system. Who compiled BCFtools in your installation? Can you check whether the BCFtools binary is a 32 bits or a 64 bits binary?
Also notice that the issue of too many open files was reported and fully resolved in BCFtools 1.21
I am running bcftools +liftover to convert a vcf file generated with hg19 to hg38 Got these errors/warnings: Merging 1466 temporary files [E::hts_open_format] Failed to open file "TempDk63j4/01022.bcf" : Too many open files Could not read TempDk63j4/01022.bcf: Too many open files [W::vcf_parse_format_fill5] Extreme FORMAT/PL value encountered and set to missing at chr1:1654143 [E::faidx_adjust_position] The sequence "X" was not found Unable to fetch sequence at X:200820--1
I've used hg19, hg38 and chain file from UCSC I can not paste any snippet of vcf as it might raise a copyright issue