cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Matlab license error on HAL? #339

Open julielyang opened 8 years ago

julielyang commented 8 years ago

Hi,

I tried batch running matlab with this call:

/opt/matlab/R2013a/bin/matlab < ./code/AffinityRegression/RunPvaluesNullDistribution/GeneratePvalues/runGeneratePvaluesAnalysis4.m

I got this license checkout error that I am posting below.

Then I just ran my code again and it ran. There are times my code works and times where I get the licensing error below.

My guess is that there was another node running the license I was trying to use. Could you help me figure out what was this error and how to run matlab all the time without getting the license error?

Julie

License checkout failed. License Manager Error -9 This error may occur when: -The hostid of this computer does not match the hostid in the license file. -A Designated Computer installation is in use by another user. If no other user is currently running MATLAB, you may need to activate.

Troubleshoot this issue by visiting: http://www.mathworks.com/support/lme/R2013a/9

Diagnostic Information: Feature: MATLAB License path: /cbio/cllab/home/jly/.matlab/R2013a_licenses:/opt/matlab/R2013a/licenses/license.dat:/opt/matlab/R20 13a/licenses/*.lic Licensing error: -9,57. Warning: No display specified. You will not be able to display graphics on the screen.

akahles commented 8 years ago

The node licenses are bound to the node. That is at most one user can use matlab at a given node at the same time. So if somebody uses matlab on the node already, you would get the error above.

There is a small script available to tell you who is using matlab on which node:

/cbio/shared/software/tools/torque/q-lic

However, this is not to be used for a regular high-throughput query as it ssh's into the nodes to find out.

On a long term the better solution would be if the scheduler knows about who is using which license on which node and then assigns jobs accordingly. I think there was a previous issue about this, but it turned out that this is not quite trivial to achieve.

tatarsky commented 8 years ago

The above is correct. The node locked license/scheduler integration was looking very difficult. The request remains but I've not had time to delve into it and may not be able to actually solve it. #193

tatarsky commented 8 years ago

And actually the better solution is floating licenses. Not node locked.

akahles commented 8 years ago

That is true, but this comes with a price tag :)

kuod commented 8 years ago

IIRC, we have 5 floating licenses on the head node. Also, another idea is to see if your script can run in the open source software octave which is not bound by price tag or licenses.

tatarsky commented 8 years ago

I thought those were for toolkits (the floating ones).

tatarsky commented 8 years ago

Yes, and I know it comes with a price. I'm looking at my assumptions for the integration of the scheduler and the licenses however again as time has passed and I'm older and perhaps wiser now.

tatarsky commented 8 years ago

So we do I believe handle the floating ones with Moab. At least there are some comments to that effect. But I don't think that helps us with node ones. Perhaps some morning we could review the history a notch on this. Its been awhile.

akahles commented 8 years ago

Sure - I mostly moved from Matlab to Python by now (yay) but am happy to help unrolling the history of this.

kuod commented 8 years ago

We have a bit of everything in terms of the toolkits and they're distributed across the nodes. I'll update with more information off the issue tracker. MATLAB usage on the head node is do-able but I would caution against potentially running jobs that bring the cluster head node to a screeching halt.

tatarsky commented 8 years ago

Confused by the words "head node" and "cluster" in the same sentence ;) I assume @julielyang wants to run Matlab on a node.

tatarsky commented 8 years ago

Agree that a brief verbal unrolling might be useful to recall. Its been pretty much the same config as far as I've ever known.

julielyang commented 8 years ago

Hi,

I would like to understand this better.

Can two users be assigned to the same node and the first user already be using the matlab license on that node?

Can this be solved by requesting a all other user free node?

Julie

On Thu, Nov 12, 2015 at 4:35 PM, Andre Kahles notifications@github.com wrote:

The node licenses are bound to the node. That is at most one user can use matlab at a given node at the same time. So if somebody uses matlab on the node already, you would get the error above.

There is a small script available to tell you who is using matlab on which node:

However, this is not to be used for a regular high-throughput query as it ssh's into the nodes to find out.

On a long term the better solution would be if the scheduler knows about who is using which license on which node and then assigns jobs accordingly. I think there was a previous issue about this, but it turned out that this is not quite trivial to achieve.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156243480.

akahles commented 8 years ago

This is correct - as the scheduler does not know about the licenses, the property matlab only ensures that your job will end up on a node that can run MATLAB but not that nobody else is using MATLAB there already.

You do not necessarily need a node without any other user on it, no other user running MATLAB ist sufficient.

tatarsky commented 8 years ago

And the part that is difficult to integrate with node locked licenses is "but not that nobody else is using MATLAB there already." I am looking at one item regarding this again. I can't guarantee it will work or that I'm going to try it rapidly, but I am looking at it.

julielyang commented 8 years ago

Okay I understand.

But since I can't guarantee a user is not using matlab on my node can I instead require a stricter restriction that no other user be on the node?

On Thu, Nov 12, 2015 at 5:07 PM, tatarsky notifications@github.com wrote:

And the part that is difficult to integrate with node locked licenses is "but not that nobody else is using MATLAB there already." I am looking at one item regarding this again. I can't guarantee it will work or that I'm going to try it rapidly, but I am looking at it.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156251062.

akahles commented 8 years ago

This would be quite inefficient, you could just request all cores on that node: -l nodes=1:ppn=24

julielyang commented 8 years ago

Ah okay. Yes it is not a efficient solution.

On Thu, Nov 12, 2015 at 5:13 PM, Andre Kahles notifications@github.com wrote:

This would be quite inefficient, you could just request all cores on that node: -l nodes=1:ppn=24

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156252448.

julielyang commented 8 years ago

Thank you for letting me know. This was really helpful~!

On Thu, Nov 12, 2015 at 5:14 PM, Julie Yang julie.li.yang@gmail.com wrote:

Ah okay. Yes it is not a efficient solution.

On Thu, Nov 12, 2015 at 5:13 PM, Andre Kahles notifications@github.com wrote:

This would be quite inefficient, you could just request all cores on that node: -l nodes=1:ppn=24

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156252448.

akahles commented 8 years ago

If you don't have a large number of jobs, you could also request an interactive session on a full node and then start your matlab processes in background all on the same node - this would be more efficient, but has slight overhead.

kuod commented 8 years ago

Don't forget to include -l nodes=1:ppn=24:matlab to your torque submission. Otherwise, you may get a node that may not even have a matlab license.

tatarsky commented 8 years ago

Yeah, before we go down that road I'd like to see if I can try something. I might be able to set a "matlabfree" properly based on @akahles script which basically is a pgrep MATLAB if I'm reading it. Its not quite the same as the original goal of #193 which if I recall was to try to assist a person that already had a Matlab license job (on a specific node) to be able to get other jobs to go there.

Can @kuod confirm gpu-1-4 is a matlab node lock machine? I believe so....

julielyang commented 8 years ago

Yes, okay.

On Thu, Nov 12, 2015 at 5:16 PM, tatarsky notifications@github.com wrote:

Yeah, before we go down that road I'd like to see if I can try something. I might be able to set a "matlabfree" properly based on @akahles https://github.com/akahles script which basically is a pgrep MATLAB if I'm reading it. Its not quite the same as the original goal of #193 https://github.com/cBio/cbio-cluster/issues/193 which if I recall was to try to assist a person that already had a Matlab license job to be able to get other jobs to go there.

Can @kuod https://github.com/kuod confirm gpu-1-4 is a matlab node lock machine? I believe so....

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156253283.

kuod commented 8 years ago

@tatarsky confirm from my notes that gpu-1-4 is a node-locked license.

akahles commented 8 years ago

I just want to confirm the original goal of #193 was as described. The purpose being to not scatter many jobs of a single user over an array of nodes and all lock them for other users.

tatarsky commented 8 years ago

I may have to wait until such a machine can be offlined as I'm not clear on my ideas impact to running jobs. This is where I wish a had a test environment. @julielyang relatively how critical would you rank this just so I can decide the best way forward. I think I can add a dynamic properly that at least flags "matlabfree" for a node that also has property "matlab".

Yeah @akahles its that user part that was making #193 very difficult.

tatarsky commented 8 years ago

And the "freedom" of matlab would be based on the process table. I do not want to turn the property "matlab" into a consumable as that then defeats the desire I know people have to stack multiple matlab jobs on the same node as the same user.

julielyang commented 8 years ago

Sure. My scripts are running on license free nodes so I will not even be using the inefficient suggested solution of obtaining a user-free node.

I will look forward for your solution. This can also be low priority on your list of to dos because my jobs are already running so I don't have a pressing need for a solution. Thanks so much for your help.

Julie Yang

On Thu, Nov 12, 2015 at 5:24 PM, tatarsky notifications@github.com wrote:

I may have to wait until such a machine can be offlined as I'm not clear on my ideas impact to running jobs. This is where I wish a had a test environment. @julielyang https://github.com/julielyang relatively how critical would you rank this just so I can decide the best way forward. I think I can add a dynamic properly that at least flags "matlabfree" for a node that also has property "matlab".

Yeah @akahles https://github.com/akahles its that user part that was making #193 https://github.com/cBio/cbio-cluster/issues/193 very difficult.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-156256439.

tatarsky commented 8 years ago

Noted. While what I may do will be a bit of a hack it might speed that selection process.

tatarsky commented 8 years ago

I tried a few things over the weekend but it didn't work as I hoped. I've opened a ticket to see if Adaptive has any ideas.

julielyang commented 8 years ago

Okay, thank you.

On Mon, Nov 16, 2015 at 1:56 PM, tatarsky notifications@github.com wrote:

I tried a few things over the weekend but it didn't work as I hoped. I've opened a ticket to see if Adaptive has any ideas.

— Reply to this email directly or view it on GitHub https://github.com/cBio/cbio-cluster/issues/339#issuecomment-157135107.

tatarsky commented 8 years ago

Some experiments in this regard are being included in efforts to bring up a new head node. Status only. Nothing end user facing at this time.

tatarsky commented 8 years ago

This issue will be further experimented with as part of #349

tatarsky commented 8 years ago

So while I can't fully explain all the paths I have walked down to get one simple resource added as a test example, I was able to finally figure out on the test Moab server how to add at least a script generated resource. The trick is now getting the proper data from a node as Moab executes these extension scripts on the scheduler system, not the node. (Yes, we can have it do stuff like ssh but I want it to be efficient).

So this is a placeholder that I've made some progress on this. But remain working on it on the test Moab system.

akahles commented 8 years ago

Could we just use a file to log this information somewhere centrally? So it would be easy to look up. But maybe I am also missing the point here ...

tatarsky commented 8 years ago

Basically if I understand what you folks want for any system running MATLAB node locked we need a dynamic resource I'm calling "MATLABUSER" which is set to the username of the person running Matlab already on the node. Allowing you to then submit additional jobs stating that as a requirement.

The information that goes into that resource on the node is from the process table and will probably end up as some kind of file based lookup. But you have to be very carefully extending a scheduler doing lots of different things with some kind of lookup....

Per what I've found out so far.

akahles commented 8 years ago

Ok, I see. That was basically what I was suggesting. But I see the danger of adding to an already quite complex system ...

tatarsky commented 8 years ago

Does anyone have an example of requesting one of the shared matlab licenses from qsub? Or do I misunderstand that part. I don't mean the :matlab item which is a node property. I'm looking for an example of these resources I see globally defined:

 License Bioinformatics_Toolbox    2 of   2 available  (Idle: 100.00%  Active: 0.00%)
  License Compiler            2 of   2 available  (Idle: 100.00%  Active: 0.00%)
  License Image_Toolbox       3 of   3 available  (Idle: 100.00%  Active: 0.00%)
  License Optimization_Toolbox    3 of   3 available  (Idle: 100.00%  Active: 0.00%)
  License Statistics_Toolbox    3 of   3 available  (Idle: 100.00%  Active: 0.00%)

I'd like to fully make sure I understand the syntax a license is asked for from qsub or the submit file.

tatarsky commented 8 years ago

OK. So I think the direction I'm going here is using a Moab nodeset which I'm learning so I may not have all the parts right.

But it appears I can dynamically based on a script output assign what are called variable attributes to nodes via Moab. Basically in this case some kind of polling for what users are running MATLAB on the nodes.

Then, what this looks like from the qsub point of view is Moab places the variable attribute like this along with some of the static ones:

checknode gpu-3-8

Attributes:         Memory=1024,Processors=1,batch,gtx680,matlabuser=paul,nv352

Selecting those nodes however requires a slightly different syntax from my findings I believe due to the use of a "varattr" compared to a feature which can't have a "name=value" pair (I think).

This appears to be the incarnation to say "run my job on systems with this variable attribute".

qsub -l nodeset=FIRSTOF:VARATTR:matlabuser=paul (some script)

FIRSTOF can also be other values.

http://docs.adaptivecomputing.com/mwm/Content/topics/optimization/nodesetoverview.html

Initial tests on a small scale using a fake "matlabuser" assignment appear to function. I'm trying to decide the rate of polling and actual mechanics of that and how to prevent hung or slow nodes from delaying things. I will probably separate the gathering of the matlabuser from Moab itself and the Moab part will just read a file.

tatarsky commented 8 years ago

And only issue I see in this approach is you may have to "seed" the process by getting the node locked license with one qsub first and then follow it up with the qsub to request your other jobs try to get placed on that node due to it. I don't see a clear way around that at the moment, but I suspect you are already doing this.

I may attempt to convert this code over to the main scheduler next week as I don't have any Matlab nodes in the small test environment (gpu-3-8 and gpu-3-9)

tatarsky commented 8 years ago

Can somebody remind me where the actual Matlab node locked license lives? I see out there a blend between:

 /opt/matlab/R2013a/licenses/license.lic
 /opt/matlab/R2013a/licenses/network.lic

I am also preparing for a automatic /opt area sync and I believe these licenses are unique and located in those trees and thus would require exceptions.

Working on a final config attempt for this process. Sorry for all the updates.

kuod commented 8 years ago

On the compute nodes there should exist a .dat file in this folder

/opt/matlab/R2013a/licenses/

IIRC, the suffixes may not all be .dat but effectively, they contain the information for the licensing server to allow the the start of Matlab.

ChayaSt commented 8 years ago

I ran into the same problem today when I tried using Matlab on hal. Had this been sorted out?

tatarsky commented 8 years ago

I feel the best solution is floating licenses. While I have a partial implementation of what I described above I wasn't overly impressed by its real usability. I can attempt some degree of it but the floating licenses path would be the more supportable one. So I'm going to ask @juanperin what he would like me to do as in the end my support tasks will be migrated to others.

jchodera commented 8 years ago

I seem to recall that MSK now has a site license. A possible solution would be to use the MSK license server.

edingtoj commented 8 years ago

If the Matlab license on hal has been maintained we could update. But in order to access the MSK site license hal needs to move inside the MSK network.

On Mar 31, 2016, at 9:41 AM, John Chodera notifications@github.com wrote:

I seem to recall that MSK now has a site license. A possible solution would be to use the MSK license server.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

jchodera commented 8 years ago

Networking can open a hole in the firewall just for specific ports and the specific MSK license server. This would presumably be dealt with via a firewall port open request specifying specific IP source range (the nodes) and a specific IP destination (the license server) with a specific range of ports for the license server. This would be subject to review by InfoSec, but the risk should be minimal due to the restricted nature and that both systems are behind firewalls.

I don't know the current status of the software licenses on hal. This has been under the aegis of the HPC core for nearly a year now, right?

juanperin commented 8 years ago

We asked about opening a port. It wasn't done because they wouldn't have been able to support too many license requests at once. The number of licenses available on the institutional license is limited. I believe its for 40 users, so they were ok giving us access on saba considering only a few nodes and users would possibly get consumed at one time, but any larger set of requests would exceed their capacity.

We'll find an appropriate option as soon as possible to enable preferably floating licenses.