Open alessandroperilli opened 11 months ago
I did indeed introduce some restrictions, mainly affecting the Dino model, as I plan to test and incorporate the DinoV2 models in a test branch later on. For the SAM model, this should only be the case if these nodes exclusively support HQ models.
Is there a need for non-HQ models? If so, I can make adjustments, although I personally believe that the mask quality, even with HQ models, isn't always very clean.
This is now working but the performance is very...inconsistent. There are a lot of strange things going on.
SwinB
model (which I assumed being more accurate than the SwinT_OGC
model) doesn't detect anything. Or better, I think it does but the Mask with prompt
node doesn't show it.In the example below, I asked to detect the eyes, and the terminal shows a confidence of 0.57:
(dnl13-seg) > phrase(confidence): ['eyes(0.57)']
However, the output is this:
Same test with the SwinT_OGC
model:
SwinT_OGC
model, certain entities are not detected properly. The same entities were flawlessly detected and segmented with the non-HQ models and the same SwinT_OGC
model in the segment anything
node.For example:
Other entities are recognized without issues, but they are less precisely segmented compared what the segment anything node with non-HQ models used to do:
(the chin is partially cut)
Similarly, in the previous scenario, non-HQ models would segment each eye perfectly. The HQ model, instead, has included the face portion between the eyes. Not sure it's an improvement. I set clean_mask_holes
and clean_mask_island
parameters to zero, but it made no difference.
If I manually set CPU, I get a reasonable speed considering these are HQ models, and the weird output shown at the previous point. But if I leave it in AUTO, the detection and segmentation takes double the time, and the results are even weirder:
If you would allow the SAM Model Loader to also load non-HQ models, like segment anything, I could offer an apple-to-apple comparison, but I can already tell you that non-HQ models have generated exceptionally good masks in most of my tests in the past.
I also think that a lot of people with limited RAM would appreciate the option to load non-HQ models.
Hm, okay, those are very interesting observations. I just saw that you mainly use the HQ*H model. My tests indeed showed good sometimes even better results with the Base (B) Model or (L) Model in combination with the SwinB for GroundingDino.
Ok, adding the SAM Base models shouldn't really be a problem. I'll put that on my todo list.
Oh, clean_mask_holes
and clean_mask_island
can take on very large values,
as this seems to reflect the pixel density of the mask. 64 as default value is mainly used to remove small parts of the mask. A value of 0 will therefore not make any corrections to the mask.
(*) The stupid thing is that usually the good masks have a lower box_threshold aka confidence than bad masks. maybe i can add a second slider here to better capture the masks. I'll have to test that.
i have an idea that it could be that the masks are not created correctly if multimask is not activated. i will look into this.
please try playing with the box_threshold here. Part of an idea see above please(*). The function structure is as follows: a lower box_threshold lets more detected image features "through". This also means that both eyes may be detected individually and here my mask merging is not yet optimal. A check with multimask definitely helps here to see what it has really found. I urgently need to check the mask creation if multimask is not activated.
The automatic device selection is very interesting. In the Auto variant the comfy.model_management.get_torch_device()
is used to find the device. Perhaps it then loads the models into the M2 graphic cores? Next week I will have the opportunity to test this myself on an M2 Mac to get a more accurate picture.
And thanks again for testing! 🙏
Edited: About the eyes https://github.com/dnl13/ComfyUI-dnl13-seg/discussions/2
I just tried every test in the previous post with the multimask setting on, and it makes no difference. I keep it off because I cannot control the order of the masks (per our previous conversation), and so I have to duplicate the Mask with prompt many times.
I also tried playing with the box threshold setting a lot more, but it makes no difference. The segment anything configuration with non-HQ models didn't require very sensitive settings. A 0.30 would be perfect to find most things.
Also, the fact that you can recognize the eyes in the second picture of #2 with a SwinB
model, and I cannot in my tests above, confirms that there's something strange going on.
Try by yourself and see if you find the eyes in this image I always use as benchmark. It normally works fine with 0.30 confidence threshold.
Hey @alessandroperilli
I gave it a try with your image. Yes, it's exactly the same as in my observation in the discussion thread.
At 0.30, three eyes are detected with different confidence levels.
(dnl13-seg) > phrase(confidence): ['eyes(0.49)', 'eyes(0.50)', 'eyes(0.31)']
Here, too, I had to raise the value to 0.32.
This behavior can be observed with all SAM models B/H/L.
With the GroundingDino SwinT_OGC, I have to set the threshold even higher at 0.40 (also tested with all SAM models B/H/L).
I'll now see how I can reintegrate the non-HQ models and then send another push. I'll mention you again here once the push is through (after testing).
What's very interesting is that in terms of functionality, only the dependencies of this script are managed differently than in the segment_anything fork. That's why some errors surprise me. However, as you mentioned, it could also be related to the models. I'll take care of it. Best regards!
That would be OK. I don't mind tweaking the threshold to get individual eyes vs a pair. The problem here is that, on my system, as I previously reported, the whole subject gets identified. Not the eyes.
On my system, the detection is fundamentally broken. With every HQ model.
If you can detect the eyes on your system, all other things equal, it has to be related to MPS.
Hello @alessandroperilli,
I've made some more changes. Could you please do a git pull and install the new requirements.txt again?
I have now integrated and tested the original non-HQ models.
Additionally, I've set all models to use the comfy.model_management.get_torch_device()
by default.
As I understand it, these models are initially loaded into the MPS on the Mac. Later, you have the option to either keep them in the MPS automatically or move them to the CPU. I hope this resolves the performance issues. If not, I'll continue to explore solutions.
Also, I believe I've identified the cause of the poor masking on your end. I had set up a "two-pass masking" for the HQ models to improve the masks (which worked on my end). However, this might have caused issues on your Mac. Therefore, I've added an optional switch for the "two-pass" process in the node. Testing this would be much appreciated. However, due to issues with the mask_input of the non-HQ models on Meta's end, I can't introduce a "two-pass" masking for these models yet. I'm currently looking into possible solutions for that.
I'm looking forward to your feedback. Thank you, and best regards!
OK, tests done. Lots to report:
If there was something that the system was supposed to change in terms of packages, it did not.
I manually run pip install -r requirements.txt
but I already had all requirements satisfied. To be extra sure, I manually deleted the custom node folder and reinstalled it via ComfUI Manager (Install via GIT Url). No difference.
If I leave the switch on AUTO
, I simply cannot detect anything remotely resembling the prompt.
Whatever happens with the AUTO
switch, it completely disrupts the capability to detect the right things on my system.
If I switch to CPU
, the system works as intended, but...keep reading about the next two issues.
SwinB
model refuses to detect the right objects. It doesn't work in conjunction with AUTO
.
It doesn't work in conjunction with CPU
. It doesn't work at any lower-bound threshold value. It's useless.sam-vit-b
model, exactly like in the segment anything
node.
The HQ models are completely unreliable, sometimes working and sometimes not. For example: the non-HQ sam-vit-b
model can accurately identify the background, while the HQ counterpart (or the Huge version of the HQ models) fails miserably.
In other words, the only configuration that works for me is the following:
It's good enough for me, as it doesn't change the segment anything
scenario. However, given all these other anomalies, I would expect a lot of support requests. In many configurations, the system is not behaving as intended.
Okay, I have some news. I just had the opportunity to install ComfyUI on an M2 Mac Mini, and yes, I was able to replicate some of the issues.
So, Apple's MPS is not a viable option. I'm considering implementing a fallback to CPU for MPS until there's some progress on Apple's end or with the models.
Regarding the SwinB model: I couldn't replicate the issues. It runs with both HQ and non-HQ models. Due to time constraints, I've only tested it with the base and HQ-H models, and it worked.
It's possible that yours got damaged during installation?
Since I'll have more opportunities to test with the Mac Mini, I can announce that I'll try to replicate issues and find solutions.
Not possible, as the node automatically downloaded it yesterday. But, just to double-check, I eliminated both the groundingdino_swinb_cogcoor.pth
file and the GroundingDINO_SwinB.cfg
file, forcing the node to redownload them again.
With a non-HQ SAM model, the background is recognized, but not the mouth or the eyes:
With a HQ SAM model, nothing is recognized.
So, I'm looking at some hints in the terminal logs.
I see this string just before the node redownloaded SwinB
:
final text_encoder_type: bert-base-uncased
Then I see the prediction that the 4 instances of Mask with prompt
make as I queue the workflow:
All keys matched successfully (dnl13-seg) > phrase(confidence): ['eyes(0.57)'] (dnl13-seg) > phrase(confidence): ['background(0.54)'] (dnl13-seg) > phrase(confidence): ['mouth(0.71)'] (dnl13-seg) > phrase(confidence): ['sign(0.64)']
I don't know what else it could be. What version of pytorch do you use on ComfyUI venv? I run on the latest nightly build, but that can't be the problem.
Update I uninstalled the nightly build and now I run on the stable PyTorch 2.1.1. Nothing has changed in the behavior of the node.
I followed the installation instructions as described on comfyui repo. so I also installed the nightly build for MPS support.
python3 -m pip freeze
accelerate==0.24.1
addict==2.4.0
aiohttp==3.9.0
aiosignal==1.3.1
altgraph @ file:///System/Volumes/Data/SWE/Apps/DT/BuildRoots/BuildRoot11/ActiveBuildRoot/Library/Caches/com.apple.xbs/Sources/python3/python3-141/altgraph-0.17.2-py2.py3-none-any.whl
async-timeout==4.0.3
attrs==23.1.0
certifi==2023.11.17
charset-normalizer==3.3.2
contourpy==1.2.0
cycler==0.12.1
einops==0.7.0
filelock==3.13.1
fonttools==4.45.0
frozenlist==1.4.0
fsspec==2023.10.0
future @ file:///System/Volumes/Data/SWE/Apps/DT/BuildRoots/BuildRoot11/ActiveBuildRoot/Library/Caches/com.apple.xbs/Sources/python3/python3-141/future-0.18.2-py3-none-any.whl
groundingdino-py==0.4.0
huggingface-hub==0.19.4
idna==3.4
importlib-metadata==6.8.0
importlib-resources==6.1.1
Jinja2==3.1.2
kiwisolver==1.4.5
macholib @ file:///System/Volumes/Data/SWE/Apps/DT/BuildRoots/BuildRoot11/ActiveBuildRoot/Library/Caches/com.apple.xbs/Sources/python3/python3-141/macholib-1.15.2-py2.py3-none-any.whl
MarkupSafe==2.1.3
matplotlib==3.8.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.2.1
numpy==1.26.2
opencv-python==4.7.0.72
packaging==23.2
Pillow==10.1.0
platformdirs==4.0.0
psutil==5.9.6
pycocotools==2.0.7
pyparsing==3.1.1
python-dateutil==2.8.2
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
safetensors==0.4.0
scipy==1.11.4
segment-anything==1.0
segment-anything-hq==0.3
six @ file:///System/Volumes/Data/SWE/Apps/DT/BuildRoots/BuildRoot11/ActiveBuildRoot/Library/Caches/com.apple.xbs/Sources/python3/python3-141/six-1.15.0-py2.py3-none-any.whl
supervision==0.6.0
sympy==1.12
timm==0.9.11
tokenizers==0.15.0
tomli==2.0.1
torch==2.2.0.dev20231122
torchaudio==2.2.0.dev20231122
torchsde==0.2.6
torchvision==0.17.0.dev20231122
tqdm==4.66.1
trampoline==0.1.2
transformers==4.35.2
typing_extensions==4.8.0
urllib3==2.1.0
yapf==0.40.2
yarl==1.9.3
zipp==3.17.0
This is what is currently installed on the machine.
But be careful, this is a fresh blank installation that only contains my nodes for testing. And it's also running on
$python3 --version
Python 3.9.6`
But it's working.
Update
the test file as described on the instructions page for apples:
python3 test.py
tensor([1.], device='mps:0')
After 1h 30min of debugging, I declare defeat.
At this point, our configuration is almost identical. And yet:
I don't really know what else to do.
For the new installations, this message continues to appear in the terminal:
xyz/ComfyUI/venv/lib/python3.11/site-packages/segment_anything_hq/modeling/tiny_vit_sam.py:662: UserWarning: Overwriting tiny_vit_5m_224 in registry with segment_anything_hq.modeling.tiny_vit_sam.tiny_vit_5m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected. return register_model(fn_wrapper) xyz/ComfyUI/venv/lib/python3.11/site-packages/segment_anything_hq/modeling/tiny_vit_sam.py:662: UserWarning: Overwriting tiny_vit_11m_224 in registry with segment_anything_hq.modeling.tiny_vit_sam.tiny_vit_11m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected. return register_model(fn_wrapper) xyz/ComfyUI/venv/lib/python3.11/site-packages/segment_anything_hq/modeling/tiny_vit_sam.py:662: UserWarning: Overwriting tiny_vit_21m_224 in registry with segment_anything_hq.modeling.tiny_vit_sam.tiny_vit_21m_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected. return register_model(fn_wrapper) xyz/ComfyUI/venv/lib/python3.11/site-packages/segment_anything_hq/modeling/tiny_vit_sam.py:662: UserWarning: Overwriting tiny_vit_21m_384 in registry with segment_anything_hq.modeling.tiny_vit_sam.tiny_vit_21m_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected. return register_model(fn_wrapper) xyz/ComfyUI/venv/lib/python3.11/site-packages/segment_anything_hq/modeling/tiny_vit_sam.py:662: UserWarning: Overwriting tiny_vit_21m_512 in registry with segment_anything_hq.modeling.tiny_vit_sam.tiny_vit_21m_512. This is because the name being registered conflicts with an existing name. Please check if this is not expected. return register_model(fn_wrapper)
It never appears in my ordinary ComfyUI installation.
Okay, that's very interesting! It seems like I'll need another attempt to get it working. I'll test all of this in a different branch. Apologies for any inconvenience. I'm truly grateful for the testing. I've noticed some inconsistencies between the git and pip installations in the issues on the SAM_HQ repository, and the same goes for GroundedDino. It seems like the pip and git repositories are diverging in versions. I'll take a closer look at that. I was hoping to streamline the code in this suite by using pip versions. I'll compare the code and see what needs to be adjusted. It surprised me that in the segment_anything nodes, everything is included in the repo even though there are pip packages. Once again, thank you so much for the tests; it helps me a lot in establishing a stable project here.
Just to be sure we exclude every scenario: can you share with me the flags you used to run ComfyUI on your macOS? And do a printenv
once inside the venv, too? Thank you.
Okay, now I've encountered the same issue on Windows as well.
There seems to be something peculiar with torchvision
and T.RandomResize([800], max_size=1333)
in the GroundingDino inference
and transform
scripts.
The images seem to be not presented correctly to the model, which is why it cannot recognize anything. I am working on it now.
I've also made sure to update everything on the Windows machine. I noticed that in my development folder, I was a few versions behind.
Now I need to compare the different implementations and see which one works well with torchvision 0.16. Please bear with me for a little longer.
Your fork of the segment anything node was working flawlessly. Instead, this new node refuses to work in the same identical workflow:
The full log is:
I thought the problem could be that this node is stricter and wants your SAM Loader rather than 3rd parties. Hence, I opened #4.
I also thought the problem could be related to #5.
Other than not working at all, this node looks very promising with many features that I really hoped to see!