bedops / bedops

:microscope: BEDOPS: high-performance genomic feature operations
https://bedops.readthedocs.io/
Other
300 stars 59 forks source link

vcf2bed failed with "Error: Could not find newline in intermediate buffer; check input [11821 | 29288 | 41109]" #230

Closed yunhailuo closed 5 years ago

yunhailuo commented 5 years ago

I tried to convert dbSNP vcf to bed for use: ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.25.gz

I gunzip it and run with the following command: bin/vcf2bed --do-not-split --do-not-sort < GCF_000001405.25 > GCF_000001405.25.bed &

After running normal (output looks fine) for a while, I got:

Error: Could not find newline in intermediate buffer; check input [11821 | 29288 | 41109]
       Please check that your input contains Unix newlines (cat -A) or increase TOKENS_MAX_LENGTH in BEDOPS.Constants.hpp and recompile BEDOPS.
convert2bed
  version:  2.4.36
  author:   Alex Reynolds

  Usage:

  $ convert2bed --input=fmt [--output=fmt] [options] < input > output

  Convert BAM, GFF, GTF, GVF, PSL, RepeatMasker (OUT), SAM, VCF
  and WIG genomic formats to BED or BEDOPS Starch (compressed BED)

  Input can be a regular file or standard input piped in using the
  hyphen character ('-'):

  $ some_upstream_process ... | convert2bed --input=fmt - > output

  Input (required):

  --input=[bam|gff|gtf|gvf|psl|rmsk|sam|vcf|wig] (-i <fmt>)
      Genomic format of input file (required)

  Output:

  --output=[bed|starch] (-o <fmt>)
      Format of output file, either BED or BEDOPS Starch (optional, default is BED)

  Other processing options:

  --do-not-sort (-d)
      Do not sort BED output with sort-bed (not compatible with --output=starch)
  --max-mem=<value> (-m <val>)
      Sets aside <value> memory for sorting BED output. For example, <value> can
      be 8G, 8000M or 8000000000 to specify 8 GB of memory (default is 2G)
  --sort-tmpdir=<dir> (-r <dir>)
      Optionally sets [dir] as temporary directory for sort data, when used in
      conjunction with --max-mem=[value], instead of the host's operating system
      default temporary directory
  --starch-bzip2 (-z)
      Used with --output=starch, the compressed output explicitly applies the bzip2
      algorithm to compress intermediate data (default is bzip2)
  --starch-gzip (-g)
      Used with --output=starch, the compressed output applies gzip compression on
      intermediate data
  --starch-note="xyz..." (-e "xyz...")
      Used with --output=starch, this adds a note to the Starch archive metadata
  --help | --help[-bam|-gff|-gtf|-gvf|-psl|-rmsk|-sam|-vcf|-wig] (-h | -h <fmt>)
      Show general help message (or detailed help for a specified input format)
  --version (-w)
      Show application version

[1]+  Exit 22                 bin/vcf2bed --do-not-split --do-not-sort < GCF_000001405.25 > GCF_000001405.25.bed

I tried to check lines as mentioned in #208 with awk '{print length($0);}' GCF_000001405.25 | sort -nr | head -3 and got:

43619
36867
31952

It doesn't seem like there is a line go beyond 5MB. Free memory to start is about 30GB and I used default 2G --max-mem.

Any suggestions on what I'm missing, @alexpreynolds ?

alexpreynolds commented 5 years ago

Can you try the "megarow" build of BEDOPS? Please see step 4 here: https://bedops.readthedocs.io/en/latest/content/installation.html#via-source-code

The "megarow" build should allow a longer line. Or you can adjust constants as described in the error message, but the custom build may help you directly.

yunhailuo commented 5 years ago

@alexpreynolds Thank you for your quick reply. Unfortunately, I got the same error (though I'm not sure if it's from the same row). Do you have any other suggestions? Is there a way to figure out which row is problematic?

alexpreynolds commented 5 years ago

I'm away until Tuesday but will try out the FTP link as soon as I'm back. I'm sure we can figure this out.

yunhailuo commented 5 years ago

Please take your time. Thank you in advance.

alexpreynolds commented 5 years ago

I tested the megarow build of convert2bed (which vcf2bed calls) and it was able to convert the FTPed VCF file without errors:

$ git clone https://github.com/bedops/bedops.git
$ cd bedops
$ make megarow
...
$ ./applications/bed/conversion/bin/convert2bed-megarow --input=vcf --do-not-split --do-not-sort < ../GCF_000001405.25.vcf > ../GCF_000001405.25.bed 2> ../GCF_000001405.25.bed.log
$ ls -al ../GCF_000001405.25.bed*
-rw-r--r--  1 areynolds stamlab 100657233343 Nov 12 15:43 GCF_000001405.25.bed
-rw-r--r-- 1 areynolds stamlab 0 Nov 12 15:21 ../GCF_000001405.25.bed.log

Perhaps you may be still running so-called typical binaries, which have been compiled with line limitations that would impact conversion of this specific VCF file.

Can you please list the steps you took to build megarow binaries? Or can you please describe how you are installing BEDOPS, as well as what platform/kernel you are using? Thanks for your patience.

yunhailuo commented 5 years ago

Thank you so much for trying it out, @alexpreynolds

I had, based on my history:

$ git clone https://github.com/bedops/bedops.git
$ cd bedops/
$ make all
$ make install_all
$ export PATH="/home/ubuntu/dbsnp/bedops/bin:$PATH"
$ bin/vcf2bed-megarow --do-not-split --do-not-sort < /home/ubuntu/dbsnp/GCF_000001405.25 > GCF_000001405.25.bed

I tried these on Ubuntu:

Distributor ID: Ubuntu
Description:    Ubuntu 14.04.6 LTS
Release:    14.04
Codename:   trusty

I'll purge everything, try your steps and let you know.

yunhailuo commented 5 years ago

It worked with the following:

$ git clone https://github.com/bedops/bedops.git
...
$ cd bedops/
$ make megarow
...
$ make install_megarow
...
$ export PATH="/home/ubuntu/dbsnp/bedops/bin:$PATH"
$ bin/convert2bed-megarow --input=vcf --do-not-split --do-not-sort < ../GCF_000001405.25 > ../GCF_000001405.25.bed 2> ../GCF_000001405.25.bed.log
$ cat ../GCF_000001405.25.bed.log 
-bash: bin/convert2bed-megarow: No such file or directory
$ bin/convert2bed --input=vcf --do-not-split --do-not-sort < ../GCF_000001405.25 > ../GCF_000001405.25.bed 2> ../GCF_000001405.25.bed.log

Thank you very much for all the help!

ashamehta commented 3 years ago

Hello - I'm experiencing a similar issue:

 ./vcf2bed-megarow --keep-header < input.vcf

Error: Could not find newline in intermediate buffer; check input [39704 | 1405 | 41109]
       Please check that your input contains Unix newlines (cat -A) or increase TOKENS_MAX_LENGTH in BEDOPS.Constants.hpp and recompile BEDOPS. 

vcf2bed-megarow was installed viamake all and I can't reinstall with make megarow, because I'm unable to install the required static libraries in the computing cluster I'm using. Is there an alternative solution?

alexpreynolds commented 3 years ago

You could perhaps try the precompiled binaries in the Releases page:

https://github.com/bedops/bedops/releases

If you're on Linux, you can use the instructions here to extract and put items in a useful directory:

https://bedops.readthedocs.io/en/latest/content/installation.html#linux

Once installed, you should then be able to use the switch-BEDOPS-binary-type helper script to switch between typical and megarow (and float128, though probably not useful here):

$ switch-BEDOPS-binary-type --help
Switch the BEDOPS binary build to typical, megarow, or float128
Usage: switch-BEDOPS-binary-type [ --help ] [ --typical | --megarow | --float128 ] [ <binary-directory> (optional) ]
ashamehta commented 3 years ago

That worked, thank you!

jingydz commented 1 year ago

/usr/local/bin/vcf2bed-megarow --input=vcf --do-not-split --do-not-sort --max-mem 30G <$(zcat 5000.genotype.vcffilter.vcf.gz.gz) >test.output.sort.bed -bash: xrealloc: cannot allocate 18446744071562067968 bytes (1331200 bytes allocated)

Has anyone encountered this problem?

alexpreynolds commented 1 year ago

Looking at this:

zcat 5000.genotype.vcffilter.vcf.gz.gz

There are two gz extensions. Is it possible this is (for whatever reason) doubly-compressed and needs a second decompression, e.g.:

... <(gunzip -c 5000.genotype.vcffilter.vcf.gz.gz | gunzip -c) ...

?