Inference Stuck - Githubissues

djFatNerd commented 7 months ago

Hi, I am trying the inference code and the process stucks here forever, may I ask u what might be the possible cause for this ?

utk-hua commented 7 months ago

I'm also having the same problem. Did you find a fix?

sunfanyunn commented 7 months ago

Could you interrupt the program and then paste the resulting stack trace here?

djFatNerd commented 7 months ago

Sure, this is the stack trace after I pause the program:

Thank you!

djFatNerd commented 7 months ago

1703439296561

Turning off multiprocessing makes the process continue, but the "results" variable which contains objects assignments is always empty.

djFatNerd commented 7 months ago

I'm also having the same problem. Did you find a fix?

No I didn't, I feel this might be an error with the multi-processing.

djFatNerd commented 7 months ago

This is the "results":

and the following is the error trace:

The .json file for the generated input string can't be loaded.

gemcollector commented 7 months ago

I also met the same issue. Not sure how to fix this.

YueYANG1996 commented 7 months ago

The error message above shows that this error is from the GPT-4 side. GPT-4 did not follow the instructions and returned a JSON with the wrong format. Rerun the code could resolve the issue, and btw, which version of GPT-4 are you using?

YueYANG1996 commented 7 months ago

Hi, I am trying the inference code and the process stucks here forever, may I ask u what might be the possible cause for this ?

For this problem, depending on the room size, this step can take several minutes.

0010SS commented 7 months ago

Turning off multiprocessing makes the process continue, but the "results" variable which contains objects assignments is always empty.

I met this issue, and a simple rerun solved my issue.

djFatNerd commented 7 months ago

The error message above shows that this error is from the GPT-4 side. GPT-4 did not follow the instructions and returned a JSON with the wrong format. Rerun the code could resolve the issue, and btw, which version of GPT-4 are you using?

Hi, I am using both 'gpt-4-1106-preview' and 'gpt-3.5-turbo' as I noticed they are the defaults in holodeck.py.

I initially only used gpt3.5 since it's more cost friendly. I noticed gpt-3.5 almost never gives an appropriate response in the correct format while gpt-4 has a much higher success rate.

However, the stuck problem still exists when _multiprocessing = True__ in object_selector.py, the process just stuck forever, but if I set it to false the code proceed.

Currently, after I proceed with gpt-4 and multi_processing = False, I am having the following issue:

I am using Ubuntu 18.04.

YueYANG1996 commented 7 months ago

Please use gpt-4-1106-preview for all the modules since gpt-3.5-turbo cannot follow the prompt well and will consistently make mistakes.

For running holodeck on a headless server, please refer to this: https://github.com/allenai/Holodeck/issues/6

alvin528 commented 7 months ago

Same issue encountered, no code changes made.

YueYANG1996 commented 7 months ago

Same issue encountered, no code changes made.

Which issue? Could you rerun the code?

alvin528 commented 6 months ago

Same issue encountered, no code changes made.

Which issue? Could you rerun the code?

Still unresolved, @djFatNerd , have you resolved this issue? My situation is the same as yours.

windandair commented 6 months ago

@YueYANG1996 I am encountering the same issue here. Following the responses in the above issue and using logging, it seems that the blockage occurs in the encode_text method of CLIP, called by the retrieve function in ObjaverseRetriever. Strangely, in previous runs, this function could be successfully invoked and return the expected results. Here are the inputs for the successful calls and the last one that resulted in the blockage. I haven't made any code modifications, and I am currently puzzled about how to resolve this issue。 The following image provides the context for the encountered issue

djFatNerd commented 6 months ago

Same issue encountered, no code changes made.

Which issue? Could you rerun the code?

Still unresolved, @djFatNerd , have you resolved this issue? My situation is the same as yours.

Hi, which issue are you referring to?

alvin528 commented 6 months ago

Same issue encountered, no code changes made.

Which issue? Could you rerun the code?

Still unresolved, @djFatNerd , have you resolved this issue? My situation is the same as yours.

Hi, which issue are you referring to?

the stuck problem exists when multiprocessing = True_ in object_selector.py, the process stuck forever. when multi_processing = False, I am having the same dependency issue as yours.

djFatNerd commented 6 months ago

Yes, I can't solve the stuck problem with multi-processing=True either.

For the dependency issue, I think it's a problem with libc6 on Ubuntu 18.04, I haven't find any solution either.

astaikos316 commented 6 months ago

When I set multiprocessing to false on a Ubuntu machine, I get a Tensor expected on one processor however tensors found on multiple processors GPU:0 and CPU error.

djFatNerd commented 6 months ago

When I set multiprocessing to false on a Ubuntu machine, I get a Tensor expected on one processor however tensors found on multiple processors GPU:0 and CPU error.

Yes, this happened to me. You need to manually move the tensors to the same device during similarities calculations.

astaikos316 commented 6 months ago

When I set multiprocessing to false on a Ubuntu machine, I get a Tensor expected on one processor however tensors found on multiple processors GPU:0 and CPU error.

Yes, this happened to me. You need to manually move the tensors to the same device during similarities calculations.

Can you share how you did that because i tried to move it manually but did not work for me.

djFatNerd commented 6 months ago

Sure.

In objaverse_retriever.py, just move the variables clip_similarities and sbert_similarities, to the device of the variable query_feature_sbert.

The following code works for me:

clip_similarities = clip_similarities.to(query_feature_sbert.device) sbert_similarities = query_feature_sbert @ self.sbert_features.T.to(query_feature_sbert.device)

djFatNerd commented 6 months ago

Same issue encountered, no code changes made.

Which issue? Could you rerun the code?

Still unresolved, @djFatNerd , have you resolved this issue? My situation is the same as yours.

Hi, which issue are you referring to?

the stuck problem exists when multiprocessing = True_ in object_selector.py, the process stuck forever. when multi_processing = False, I am having the same dependency issue as yours.

During the past few days I researched and saw lots of discussions around libc6 on Ubuntu 18.04 but I tried and none of them solved the problem.

I had to solve this by switching to Ubuntu 20.04.

Dan5Playground commented 6 months ago

Hi @djFatNerd , does everything work for you after switching to Ubuntu 20.04? I'm getting RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method when uses Ubuntu 20.04, sets multiprocessing to false and fix the tensor device. Thanks

astaikos316 commented 6 months ago

I am using Ubuntu 22.04 and getting the same CUDA error now after moving the tensors. I have set multiprocessing to false.

On Mon, Jan 8, 2024 at 2:27 PM Dan5Playground @.***> wrote:

Hi @djFatNerd https://github.com/djFatNerd , does everything work for you after switching to Ubuntu 20.04? I'm getting RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method when uses Ubuntu 20.04, sets multiprocessing to false and fix the tensor device. Thanks

— Reply to this email directly, view it on GitHub https://github.com/allenai/Holodeck/issues/3#issuecomment-1881694959, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALYQTGIWRDMJ43FRK6YUBF3YNRCC3AVCNFSM6AAAAABA6PI3EKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGY4TIOJVHE . You are receiving this because you commented.Message ID: @.***>

djFatNerd commented 6 months ago

Hi @djFatNerd , does everything work for you after switching to Ubuntu 20.04? I'm getting RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method when uses Ubuntu 20.04, sets multiprocessing to false and fix the tensor device. Thanks

Yes, everything worKs for me now.

This is another issue, you can solve this by adding this line of code at the beginning of main.py.

torch.multiprocessing.set_start_method('spawn')

allenai / Holodeck

Inference Stuck #3