agentmorris commented 1 year ago

Hi! I am running MD on a set of images on my local GPU. I get the following warning very often (it seems to be correlated with when I am multitasking on the GPU).

WARNING: NMS time limit 0.330s exceeded

From my understanding this is thrown by yolo when it's trying to merge bounding boxes in detections (I'm not sure though).

Should I make sure I increase the time limit? Is there some other general good practices I am missing?

Issue cloned from Microsoft/CameraTraps, original issue posted by VLucet on Oct 11, 2022.

agentmorris commented 1 year ago

Thanks for reaching out... you're right that this is an issue that will impact results, and no, we don't see this often.

So on the one hand, it would be preferable if YOLOv5 just warned you there but didn't actually break, so personally I would be comfortable either manually increasing that time limit or commenting out the break, and we could do this (as a one-off patch to the YOLOv5 source) as part of our MD setup process.

But on the other hand... what do you mean by "multitasking on the GPU"? I would highly advise against running multiple jobs on the same GPU... this is playing with fire a little, and likely won't make things faster than just running jobs sequentially. So I would start by avoiding concurrent use of the GPU by multiple processes, and see if that resolves the issue. NMS actually happens on the CPU, so it's still possible for things to get hung up due to multiple unrelated processes running, but this is a good start.

I'll leave this issue open for a while to consider whether it's worth a manual patch to change this limit or eliminate the break statement as part of our setup process.

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

I see, thanks for the fast answer. I am new to computing on the GPU.. by multitasking I mean that I am using the GPU for my other tasks (driving my monitors via HDMI for example). I am on Pop-Os, and it looks like I need to figure out how to isolate the GPU for just running torch and drive my monitor via the intel graphics.

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

Oh, sorry, I misunderstood. That stuff should generally be fine; NVIDIA GPUs and their drivers are pretty good at sharing resources with the windowing system during normal operation.

But it could be the case that things behave differently in Pop!_OS or that the windowing system doesn't play quite as nicely with the driver, it likely hasn't been tested as extensively under GPU workload scenarios as run-of-the-mill-Gnome-on-Ubuntu. So in this case, as you suggested, it may be worth trying to use a separate video device for your system graphics (or ssh'ing into your machine and running entirely headless), at least to see whether that resolves the issue.

Let us know what you learn!

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

Ok, so the good people at System76 have actually thought about this type of thing: I can easily toggle the os into compute graphics mode, making the GPU available for computing but leaving the rendering to the integrated graphics chip.

When I run MD that way, I don't get warning from yolov5.

The only drawback it seems is that I can't use my external monitor, which on my ThinkPad seems to require HDMI from the GPU (but maybe there's a workaround?).

So anyway: it seems that it was the concurrent use of the GPU by the rendering engine that was the problem here.

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

Good data! Thanks for reporting back. I don't love that the timeout causes different results, but I'm also hesitant to get into the business of patching code as part of the setup (or forking YOLOv5 just for this), so if a few weeks go by and no one else encounters this issue, I'll close this issue without action; if anyone else chimes in to say that they're also encountering this, I'll add an optional step to the setup instructions.

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

So, I am still getting this warning while the GPU is in compute mode (only the python process is running on it). This happens much less frequently (so far, a few times on a 100 000 images). But I still wonder if there would be a way to figure out what is happening.

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

In general, how busy is your CPU? Is the machine overall heavily loaded with other (non-GPU tasks)?

And actually, in your case, even if we manage to attribute this to heavy CPU load, I would recommend that you hard-code a change to either raise that time limit (to basically infinity) or remove the break statement. Even a few times in 100,000 images will be really annoying for you, since it means that you will get slightly different results on the same data in separate model runs, which will drive you bonkers if you do any debugging of the results.

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

I have left my machine running with only my terminal open (I tried to close all background apps) for that test. It seems to have higher chances of happening on specific files: when I restart MD from a previous checkpoint, it seems to be only throwing that warning on images that belong to specific checkpoint chunks.

The python process does use 100% of whichever core it is running on, though (what I mean is that from running htop I can see the python process bringing CPU - thread? - usage to 100%).

To do what you suggest, should I just edit the source code for the clone of yolo5 I have put on the path as part of the MD setup, or do I have to meddle with the files of the conda environement?

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

For now, yes, I would literally just edit your clone of YOLOv5 and either comment out the break statement or raise the time limit. You shouldn't have to change anything about the environment.

If you are able to track this down to specific images, that would be really interesting... I've never seen images where MegaDetector finds like a thousand zillion overlapping boxes, but that would be possible explanation here. Are you studying animals that occur in very large herds?

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

Ok, I'll do that and also try to track it down.

No large herds, but landscapes on which I don't think MD has not been trained a lot on (northern ontario)? I could probably afford to raise the detection threshold a bit, I've noticed that a lot of foliage gets detected with low confidence, so maybe it goes a little wild on some images with heavy foliage? It's more likely to be a me/my machine/my setup problem than a MD problem though.

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

I agree, that sounds like relatively familiar territory for MD, so I don't think we'll find a satisfying answer in the bounding boxes themselves, this is more likely to be a machine-specific issue. But definitely one we'd like to solve; there should never be a scenario where you get different results depending on which machine you run MD on. I'm getting slightly more enthusiastic about at least providing an optional patching step as part of MD setup; if your manual patch is successful, I'll put that on the todo list.

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

Ok, so:

Raising the threshold makes the warnings go away, or at least the frequency of them decreases (I need to track down the specific files it's looking at). I raised it to 0.1, then 0.5.

I then reseted the threshold to default, and commented out the break statement and modified the call to the logger as such:


# Original
if (time.time() - t) > time_limit:
LOGGER.warning(f'WARNING NMS time limit {time_limit:.3f}s exceeded')
break  # time limit exceeded

Patched

time_elapsed = (time.time() - t) if time_elapsed > time_limit: LOGGER.warning(f'WARNING: NMS time limit {time_limit:.3f}s exceeded / Time taken: {(time_elapsed):.3f}s')

break # time limit exceeded

Which outputs: 
```bash
WARNING: NMS time limit 0.330s exceeded / Time taken: 0.345s

Another example yields a time of 0.332s. So, at least for these couple instances, a pretty small departure from the time limit that is imposed.

(Comment originally posted by VLucet)

agentmorris commented 1 year ago

OK, I saw this on at least one other machine, and it philosophically drives me bonkers that one might get slightly different results depending on the speed of your machine. Submitting a PR to the YOLOv5 repo won't help us here, because we depend on an old commit of the repo, plus the authors indicate in an issue that the right approach here is to change the timeout in the code; i.e., this is working as intended. So our choices were (a) add a bunch of patching steps to the MD setup instructions, or (b) fork the YOLOv5 repo, reset to the commit we use, and make a change there to ensure that NMS timeouts are just a warning.

I went with option (b); it turns out this cleans up the setup instructions anyway. So now rather than instructing users to clone the official YOLOv5 repo and resetting to a specific commit, the MD setup instructions ask users to check out a fork of the YOLOv5 repo. The README in that form highlights that the only changes are (a) rolling back to the commit we like and (b) this tiny NMS change.

@VLucet, it would be great if you could confirm that this solution works for you, but I'm going to close this issue now as fixed.

Thanks for bringing this up!

(Comment originally posted by agentmorris)

agentmorris commented 1 year ago

For posterity, the relevant change to the forked repo is here.

(Comment originally posted by agentmorris)

agentmorris / MegaDetector

NMS time limit exceeded #74

Patched

break # time limit exceeded