biocore-ntnu / epic2

Ultraperformant reimplementation of SICER
https://doi.org/10.1093/bioinformatics/btz232
MIT License
56 stars 9 forks source link

Error handling -1 positions in BEDPE files #67

Open bu-bgregor opened 1 year ago

bu-bgregor commented 1 year ago

epic2 fails when reading a BEDPE file with a -1 value for a start or end position:

Traceback (most recent call last):
File "/share/pkg.7/epic2/0.0.52/install/epic2_env/bin/epic2", line 257, in <module>
_main(args)
File "/share/pkg.7/epic2/0.0.52/install/epic2_env/lib/python3.10/site-packages/epic2/main.py", line 35, in _main
effective_genome_length, chromsizes = egl_and_chromsizes(args)
File "epic2/src/genome_info.pyx", line 320, in epic2.src.genome_info.egl_and_chromsizes
File "epic2/src/genome_info.pyx", line 136, in epic2.src.genome_info.find_readlength
OverflowError: can't convert negative value to uint32_t 

uint32_t types are used to read the start/end positions in genome_info.pyx and read_files.cpp, which of course won't correctly handle a negative value. The BEDPE format allows the -1 value to indicate unknown positions.

start1 - The zero-based starting position of the first end of the feature on chrom1. The first base in a chromosome is numbered 0. As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required. Use -1 for unknown.

Any suggestion on how to handle this?

endrebak commented 1 year ago

How should I handle unknown positions? I do not see a way.

It would be better to remove unknown positions first:

grep -v '-1' bad.bedpe > good.bedpe

On Fri, Dec 16, 2022 at 8:18 PM Brian Gregor @.***> wrote:

epic2 fails when reading a BEDPE file with a -1 value for a start or end position:

Traceback (most recent call last): File "/share/pkg.7/epic2/0.0.52/install/epic2_env/bin/epic2", line 257, in _main(args) File "/share/pkg.7/epic2/0.0.52/install/epic2_env/lib/python3.10/site-packages/epic2/main.py", line 35, in _main effective_genome_length, chromsizes = egl_and_chromsizes(args) File "epic2/src/genome_info.pyx", line 320, in epic2.src.genome_info.egl_and_chromsizes File "epic2/src/genome_info.pyx", line 136, in epic2.src.genome_info.find_readlength OverflowError: can't convert negative value to uint32_t

uint32_t types are used to read the start/end positions in genome_info.pyx and read_files.cpp, which of course won't correctly handle a negative value. The BEDPE format https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format allows the -1 value to indicate unknown positions.

start1 - The zero-based starting position of the first end of the feature on chrom1. The first base in a chromosome is numbered 0. As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required. Use -1 for unknown.

Any suggestion on how to handle this?

— Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUVVSD7IZTEJR5YPVKTWNS6BVANCNFSM6AAAAAATBMZ6UA . You are receiving this because you are subscribed to this thread.Message ID: @.***>