czbiohub-sf / yogo

The "you only glance once" object detection model
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

OverflowError when drawing predictions when dataset coordinates have too many decimal points #150

Open zbarry opened 5 months ago

zbarry commented 5 months ago

Hi! Thanks for putting together this awesome tool. I wanted to flag this for you, as I encountered it when creating a new dataset:

When your coordinates have too many decimal points, you get this error when drawing the yogo predictions:

Traceback (most recent call last):
  File "/opt/conda/bin/yogo", line 8, in <module>
    sys.exit(main())
  File "/home/zachary/yogo/yogo/__main__.py", line 14, in main
    do_training(args)
  File "/home/zachary/yogo/yogo/train.py", line 654, in do_training
    mp.spawn(
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
OverflowError: Python int too large to convert to C long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/zachary/yogo/yogo/train.py", line 84, in train_from_ddp
    trainer.train()
  File "/home/zachary/yogo/yogo/train.py", line 342, in train
    self._validate()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/zachary/yogo/yogo/train.py", line 410, in _validate
    draw_yogo_prediction(
  File "/home/zachary/yogo/yogo/utils/utils.py", line 253, in draw_yogo_prediction
    draw.text((r[0], r[1]), label, (0, 0, 0, 255), font_size=16)
  File "/opt/conda/lib/python3.10/site-packages/PIL/ImageDraw.py", line 556, in text
    draw_text(ink)
  File "/opt/conda/lib/python3.10/site-packages/PIL/ImageDraw.py", line 540, in draw_text
    self.draw.draw_bitmap(coord, mask, ink)
SystemError: <method 'draw_bitmap' of 'ImagingDraw' objects> returned a result with an exception set

In this case, my labels looked like: 0 0.05694444291293621 0.04537036921828985 0.037962961941957474 0.037962960079312325

When I truncated them to just 5 decimal points, this issue went away.

Axel-Jacobsen commented 5 months ago

Hi! Thanks for giving YOGO a go! I just have a phone right now so it'll be difficult for me to debug. Maybe this can be fixed w some cast before that draw command? But something tells me that something else is up.

I'll be able to properly address this in about a week, or maybe @i-jey can take a look?

Thanks

i-jey commented 5 months ago

Hi @zbarry, starting to dig into this now. Out of curiosity, what are the dimensions of the images you're using?

zbarry commented 5 months ago

My images are 512x512 - bounding boxes are originally based on pixel locations of cells before conversion to the normalized coordinates. The precision of the resulting floating point values is definitely overkill for the original image size.

paul-lebel commented 5 months ago

@zbarry , can you tell us more about the system you're running? OverflowError should only relate to integer size and whether you're on a 32-bit or 64-bit OS. I don't know enough about torch is doing things under the hood, but if they are converting your floats to integer math it makes sense that you'd get an overflow if you specify too many sigfigs. On our end we usually have 11, which is also vastly overkill.

Axel-Jacobsen commented 5 months ago

also looking at this. I have a hunch it is a result of the bit-ness of the OS or something else. Note that we want to fix this s.t. the precision doesn't matter, many tools output normalized coords just by writing float(coord). It would be bad to have the user reduce the number of digits

paul-lebel commented 5 months ago

@Axel-Jacobsen I definitely agree, it was more of a diagnostic question.

zbarry commented 5 months ago

It's a google cloud deep learning VM on 64 bit architecture running 64 bit python in a pretty standard environment. I'm not sure there's anything particularly weird about the general env itself rather than there maybe being a bug w.r.t. particular package versions I'm using. I've attached the output of a conda env export for the package versions I'm using (renamed .yml extension so github would accept it).