codalab / codalab-worksheets

A collaborative platform for reproducible research (web interface and CLI).
Other
154 stars 83 forks source link

[No queue: nlp requested] error happening intermittently when running nlp queue with Stanford workers #4405

Open AndrewJGaut opened 1 year ago

AndrewJGaut commented 1 year ago

I can't reproduce this anymore. I believe it's working correctly.

AndrewJGaut commented 1 year ago

Reopening this as there is an issue that we've identified that occurs with bundles run on worksheets that are not publicly readable.

For instance, consider the following example, with the following steps: image Here's me executing those commands:

image

Now, the weird thing is, I can't reproduce this locally! When I do something similar (with private worksheet test-ws) I get the following:

image

We see that the staged bundle on the private worksheet does show up in the search output! The only difference I've been able to identify thus far is that the staged_status fields differ. On my local, I don't get the `No queue (nlp requested) issue; see here:

image

Now, the reason this is causing the issue that we see from the worker-manager is that the worker-manager uses a search call (equivalent to that made by the client when running cl search) that precisely runs cl search .mine state=staged. Therefore, any bundles that don't show up in that output aren't picked up by the slurm worker-manager and so they remain in staged forever.

Note: we have verified that the slurm worker-manager and cl search work fine with bundles on public worksheets.

The question: why are staged bundles on private worksheets not being picked up by cl search on prod, even though they are when I run the same commands locally? I'm investigating further...

AndrewJGaut commented 1 year ago

Moreover, I don't see anything amiss in the query or the database entry.

The query looks as follows: image

Here's what the bundle that's staged in the private worksheet looks like in the prod database: image

And here's what my user ID is: cl info -f id image

AndrewJGaut commented 1 year ago

Try doing this as root on main instance and see if it works. If so, try creating a non-root account locally and then reproduce.