BIMSBbioinfo / janggu

Deep learning infrastructure for genomics
GNU General Public License v3.0
254 stars 33 forks source link

flank-option close to chromosome boundaries leads to invalid genomic intervals, which can't be parsed correctly by utils._str_to_iv #8

Closed remomomo closed 5 years ago

remomomo commented 5 years ago

I've run into an issue when trying to load fasta sequences from a reference.

I have a sequence with the following coordinates in my ROI .BED-file:

chr17   0   600 .   1   +   22,23,24,25,114,121,122

I try to load the sequences from an hg19 reference:

    seq_train = Bioseq.create_from_refgenome('seq_train', refgenome=args.ref, roi=args.roi_train, flank=200, store_whole_genome=False)

This gives me an error:

  File "/home/rmonti/miniconda3/envs/janggu/lib/python3.6/site-packages/janggu/utils.py", line 308, in _str_to_iv
    raise Exception('Unable to parse {} into genomic interval:\n{}'.format(givstr, e))
Exception: Unable to parse chr17:-200-800 into genomic interval:
invalid literal for int() with base 10: ''

The original error message only contained the last line (invalid literal...), I wrote the exception handling to see what was going on.

Arguably trying to parse sequence beyond the chromosome boundaries should lead to an error (?), but I would prefer there being a warning and the sequence to be zero-padded instead.

wkopp commented 5 years ago

Thanks @remomomo for pointing that out. I've fixed this issue. Now, subintervals that stretch beyond the chromosome start or end will be zero-padded automatically.