GMUEClab / ecj

ECJ Evolutionary Computation Toolkit
http://cs.gmu.edu/~eclab/projects/ecj/
123 stars 42 forks source link

Parallel Processes: java.io.EOFException #80

Closed ZvikaZ closed 2 years ago

ZvikaZ commented 3 years ago

I'm trying to to use parallel processes, as described at chapter 6 of the manual. The slave errors with:

FATAL ERROR:
Unable to read the subpop number from the master:
java.io.EOFException
FATAL ERROR:
Unable to read type of evaluation from master.  Maybe the master closed its socket and exited?:
ec.util.Output$OutputExitException:
Exception in thread "MainThread: " ec.util.Output$OutputExitException:
        at ec.util.Output.exitWithError(Output.java:133)
        at ec.util.Output.fatal(Output.java:544)
        at ec.eval.Slave.main(Slave.java:523)

The master shows:

 Slave attempts to connect.
Slave /127.0.0.1/0 connected successfully.
FATAL ERROR:
There are no individuals with a valid fitness (that is, with their evaluated set); Cannot compute best-so-far statistics

In the master's params file, I added:

# parallel eval processing (master/slave)
eval.masterproblem = ec.eval.MasterProblem
eval.master.port = 15000
eval.masterproblem.job-size = 1
eval.masterproblem.max-jobs-per-slave = 1
eval.compression = false
evalthreads = 1

and the slave params file is:

parent.0 = sample.params
eval.master.host = localhost
eval.return-inds = false

If you want to run it yourself, the problem is demonstrated at my sample ecj repo: https://github.com/ZvikaZ/ECJ-sample , the branch PARALLEL.

I have seen the problem both with Linux and Windows machines. At the beginning I used two different Linux machines, but then I changed the the slave to run on the same machine, and connect to localhost to eliminate "noise".

BTW, it's interesting to note that initially I tried with a more complex repo (https://github.com/ZvikaZ/BPGP-wumpus, branch MASTER_SLAVE), and it failed similarly in the slave side - however, it was worse - because the master process didn't recognize that there was an error.

eclab commented 3 years ago

Running on MacOS High Sierra (thus Java 8)

  1. I downloaded and compiled your sample repo (btw, don't use 'var', it's not available on versions of Java for certain MacOS versions, such as mine)

  2. I copied sample.params to master.params

  3. I added this to the end of it:

    eval.masterproblem = ec.eval.MasterProblem eval.master.port = 15000 eval.masterproblem.job-size = 1 eval.masterproblem.max-jobs-per-slave = 1 eval.compression = false evalthreads = 1

  4. I made a file called slave.params, containing:

    parent.0 = sample.params eval.master.host = localhost eval.return-inds = false

  5. I ran java ec.Evolve -file master.params

  6. I ran java ec.Evolve -file slave.params

Worked great. So I'm not sure what to say. Now I have a few minor changes in my ECJ code that's not on the repository, but they're all with respect to the GroupedProblemForm hack I'm not sure I want to make official: they're pretty minor too. I don't think that's it.

There wouldn't be any "noise" -- sockets are guaranteed fail-fast. Do you have some kind of monitor preventing socket transmission without permission?

Sean

On Aug 25, 2021, at 6:35 AM, Zvika @.***> wrote:

I'm trying to to use parallel processes, as described at chapter 6 of the manual. The slave errors with:

FATAL ERROR: Unable to read the subpop number from the master: java.io.EOFException FATAL ERROR: Unable to read type of evaluation from master. Maybe the master closed its socket and exited?: ec.util.Output$OutputExitException: Exception in thread "MainThread: " ec.util.Output$OutputExitException: at ec.util.Output.exitWithError(Output.java:133) at ec.util.Output.fatal(Output.java:544) at ec.eval.Slave.main(Slave.java:523)

The master shows:

Slave attempts to connect. Slave /127.0.0.1/0 connected successfully. FATAL ERROR: There are no individuals with a valid fitness (that is, with their evaluated set); Cannot compute best-so-far statistics

In the master's params file, I added:

parallel eval processing (master/slave)

eval.masterproblem = ec.eval.MasterProblem eval.master.port = 15000 eval.masterproblem.job-size = 1 eval.masterproblem.max-jobs-per-slave = 1 eval.compression = false evalthreads = 1

and the slave params file is:

parent.0 = sample.params eval.master.host = localhost eval.return-inds = false

If you want to run it yourself, the problem is demonstrated at my sample ecj repo: https://github.com/ZvikaZ/ECJ-sample , the branch PARALLEL.

I have seen the problem both with Linux and Windows machines. At the beginning I used two different Linux machines, but then I changed the the slave to run on the same machine, and connect to localhost to eliminate "noise".

BTW, it's interesting to note that initially I tried with a more complex repo (https://github.com/ZvikaZ/BPGP-wumpus, branch MASTER_SLAVE), and it failed similarly in the slave side - however, it was worse - because the master process didn't recognize that there was an error.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

ZvikaZ commented 3 years ago
  1. I ran java ec.Evolve -file slave.params
  1. I assume it's a typo, and you run java ec.eval.Slave -file slave.params?
  2. Maybe it's related to Java versions? Somehow it works well with Java 8, but stops working with newer Javas? (I will try it tomorrow with Java 8)
  3. Maybe it's related to MacOS vs other OSes? (I don't have a MacOS available to me, it'd be great if you can verify it with Linux/Windows)
eclab commented 3 years ago

Let me look at it closer.

On Aug 25, 2021, at 4:09 PM, Zvika @.***> wrote:

I ran java ec.Evolve -file slave.params I assume it's a typo, and you run java ec.eval.Slave -file slave.params? Maybe it's related to Java versions? Somehow it works well with Java 8, but stops working with newer Javas? (I will try it tomorrow with Java 8) Maybe it's related to MacOS vs other OSes? (I don't have a MacOS available to me, it'd be great if you can verify it with Linux/Windows) — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://secure-web.cisco.com/1SYKvNoFaFVrPXPgotDlV06GHuuYHsv2EsNdHbuapsaHhtx8DA0e77kxSvzGwDJ_jAwPmdfTfaJag6Kg0qcw-2TcaxG3VLSj1e4S39JgMFoC-I8dlqADG22lBTcD2KXa7um0xm_u2MmRgKYXansJ3ujTzgZLTtBtFmXlZtDuEZJRgs0BPemuvg2NZaE1zWueOFhyEH6FuXIQW4jeBnWXyKQH8DFKNaF3pzgLDNJOdBVXOAxHBZhBgaO8Lw33HLfzpU_jMGoq2Sa1Z_WC8JfNnf8p0ZU-txVu3cjba0Hc02MmB2bHwW9rGx1UWoZnforzr9Wae-glmKLJ9WWZfoySvjCmc0VLYu6QErKQHr8ugdHpt2AoikOd86UIwhuxumPrtW1SSf9dJR7pkD6U4rhV_CwjYFvs8NObtfxftgXBiQUtlt_NmW6k4mbBYxcDbyMJB/https%3A%2F%2Fgithub.com%2FGMUEClab%2Fecj%2Fissues%2F80%23issuecomment-905838546, or unsubscribe https://secure-web.cisco.com/1JgJprfsCl5x2cyrqLtgFrRLyv0ypxzbUaRP3rSTWcGUr5UlOP-T2Cw5tWqDUwpXnrGj7a10QCoupfZCkNzFN_ue2dFqVlJtITDzaga5CHMOCgKp_a_e1929KhzUgkxtole2lld6zwhwGpqE9MbhbdPbvzK2L5YmgejxjfrHK5zwJ5Kn-ONV3w2oYrg-u3CJ!%20vMSGnHfsqAI6CZE79ALKjlZTkfEhyEH1KiH7zZykGkYNONLWMwXooRw-32ddJHCXI_VRGauZMuw14BibsWF_ucwsehshCJ59ml0pcGmva7mW6GLR-0pqlLlSq8w7UjlS963E41NK-ITqAWfwZ27Imv3xRxTCjsbNmiDHSwnLglkxPXSNBy7HIPGdEGXwwttZh98u2v1gYV2oAnHf0X2KoQsprqA2heoybJ6ja0TRURbnrzUl-UB1BeYT-9_4ryQ2Z/https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADAZDVGP6YVBXAOZQELD4OTT6VEXNANCNFSM5CYYLZOA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

ZvikaZ commented 3 years ago

Hi. Have you been able to reproduce the issue? Meanwhile, I've tried with Java 8, and it behaved the same.

ZvikaZ commented 2 years ago

Hi. Any news?

ZvikaZ commented 2 years ago

Note: in order to facilitate running, I have added Maven goals.

To compile: mvn clean compile To run on master host: mvn exec:java@master To run on slave host: mvn exec:java@slave

I'd really like to know if you can rec-create the problem. If it's an issue with ECJ, it's one thing. But if it's something that I've done wrong, I'd really like to trace the difference and fix accordingly.

eclab commented 2 years ago

Okay, I've been testing on my end and I think ECJ is running properly. I believe you may be somewhat mistaken as to how slaves work.

When you run under a slave, you have two options:

  1. Standard Option. The slave is given individuals and evaluates them.
  2. Opportunistic Evolution option. The slave is given a group of individuals as a subpopulation and is asked to do some evolution of its own on them for a while as a kind of mini-ECJ.

You're set up for #1. In this case you're seeing a zero-length subpopulation because there is no subpopulation. You just have the individual you were given and are being asked to evaluate it.

Sean

On Oct 19, 2021, at 12:54 PM, Zvika @.***> wrote:

Note: in order to facilitate running, I have added Maven goals.

To compile: mvn clean compile To run on master host: mvn @. To run on slave host: mvn @.

I'd really like to know if you can rec-create the problem. If it's an issue with ECJ, it's one thing. But if it's something that I've done wrong, I'd really like to trace the difference and fix accordingly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://secure-web.cisco.com/1ClLwHxJuHuf2vL73NcmVHg9tPzYpeyni9nVR5AE6ZGgBsuMrzqo-U2dIBsjLNOH54T8zxyRmcCKMwQtJPMw6M7VUuCbqjk04OO-8xbGw12ojq1J8YuMrR5Lc1zT2y5VI1htyYx42HsGXbkEeEmIQoVonTVO_NH0QAGT-_en_d62_k29vgJoH9QbCan4S7Xy8Rcj4auLbVZA7fakgOHb9X9ZjWpX8fRS2Y-aI-g7EHB5tiwINs1yx4UYF9-qC5W0wrXL8KeqRhfxTfb2EvIhWaP5ybx3QszcmN_3G95vAI-3Pzi2kS5h3AvZI2JLt7MgLi2UVPKw_PR6zo8h1HlZnmeN9mDAPZ6xiNj5dPEjGOfhiX4UD5YPsg0CeyTtSNJoT9cpfcjzLVW2blgP59V_ByLJJDTrVTZkTxyvzBSdcqfH0u5JOYwdBBf4LNQQJ3kVX/https%3A%2F%2Fgithub.com%2FGMUEClab%2Fecj%2Fissues%2F80%23issuecomment-946915370, or unsubscribe https://secure-web.cisco.com/1ixLuFvRFuyZW4sVWvhTbWfZUorJNF6a9B0zkVnVodc6W9qWBJoiq_e4XPCs59RReo76-A1eWHUcgSDbM9sUkifguMyxcovzlJTXkzNjA8q89oaZf2ZUZgXn5eNhzXy781U0Miw6NsTIJud_mUfYfe-TvjhG6U05IZu9jU_sCqlgoyMukgbeYB2joz5yYVEo!%20vex4dqRzzatQhFx4o4A1-JcwgLuYmPVGBg2b4iO-3GaucmmNWxslna7kRqO0M7LzBEh4imw_mFjD4cKk0fvbEfnyW5GtUM7Kf3BZt4_aTpjlA5HBNor9yuDcUXSzakbWjKctkoQdWlit_uxTz_IcIQMUhZvELmVhIy1mHkzll0TVzYknun9Cg-N5RRn_13Mgvx16u6WnsXxoWAhx4cEXOmo0BrlRRcPVrHGu__Ny_0quW0vRopr9Z9Vj9D9lhsB0u/https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADAZDVGHMQB4SVHNWYY7O4TUHWPC7ANCNFSM5CYYLZOA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ZvikaZ commented 2 years ago

I agree that I want the first, "standard", option. Can you refer me to some working example that I can use as a reference?

ZvikaZ commented 2 years ago

And anyway, I'm not sure I understand where's my mistake. The error I receive on the master side is:

 Slave attempts to connect.
Slave /127.0.0.1/0 connected successfully.
FATAL ERROR:
There are no individuals with a valid fitness (that is, with their evaluated set); Cannot compute best-so-far statistics

And on the slave side:

FATAL ERROR:
Unable to read the subpop number from the master:
java.io.EOFException
FATAL ERROR:
Unable to read type of evaluation from master.  Maybe the master closed its socket and exited?:
ec.util.Output$OutputExitException:

I'm not sure how there errors are related to anything I've done?

The only place I use subpop is in src/main/java/SampleProblem.java, lines 23-26 ; but I've commented those, and the same error persists.

Let me write my rational of slave usage, according to what I understood from the ECJ manual, and please tell me if I'm mistaken:

The environment is working well without master/slave mechanism; then I've added eval.master* lines to the sample.params, and slave.params, and then start one master and (currently only) one slave host, and it should Just Work (TM). Is it accurate?

Thanks

EDIT

I've found the error mistake that I got, Unable to read the subpop number from the master: at ecj/src/main/java/ec/eval/Slave.java , line 550.

eclab commented 2 years ago

On Oct 19, 2021, at 2:18 PM, Zvika @.***> wrote: And anyway, I'm not sure I understand where's my mistake. The error I receive on the master side is:

Slave attempts to connect. Slave /127.0.0.1/0 connected successfully. FATAL ERROR: There are no individuals with a valid fitness (that is, with their evaluated set); Cannot compute best-so-far statistics

And on the slave side:

FATAL ERROR: Unable to read the subpop number from the master: java.io.EOFException FATAL ERROR: Unable to read type of evaluation from master. Maybe the master closed its socket and exited?: ec.util.Output$OutputExitException:

I am not getting these errors at all. I can run

java ec.Evolve -file master.params -p generations=100
java ec.eval.Slave -file slave.params

And I have no issues. My master.params looks like this:

parent.0 = sample.params
eval.master.host = localhost
eval.return-inds = false
eval.masterproblem = ec.eval.MasterProblem
eval.master.port = 15000
#you can make this 1 of course -- but if we only have one slave to test with we might as well be 40
eval.masterproblem.job-size = 40
eval.masterproblem.max-jobs-per-slave = 1
eval.compression = false
evalthreads = 1

my slave.params file says:

parent.0 = sample.params
eval.master.host = localhost
eval.return-inds = false

I commented out this in Sample.java

Subpopulation p = state.population.subpops.get(0);
if (p.initialSize != p.individuals.size()) {
    state.output.fatal("someone got lost!!! (you might want to comment `breedthreads = auto` in the params file)");
}

Now, I don't have python running, so each individual appears to get the same fitness, but that's a side issue.

My first suspicion is that your socket is getting blocked or is being shut down early, are getting blocked somehow. It's very strange. Linux shouldn't have any issue at all either (they're both just using standard Unix sockets). Running out of ideas.

ZvikaZ commented 2 years ago

I've followed what you wrote here, and it fails with the same error. Is it possible that its OS-related? Maybe ECJ's current master-slave code works only on Macs, and not on Windows and Linuxes (I've tried both)? Or maybe it's something specific to your JDK (again, I've tried few JDK versions, neither helped...)

Can you please check this on some Linux machine? Or maybe provide me with a clean example, that I can check on my machines?

eclab commented 2 years ago

Just did a test on a large redhat server of ours. Works great. Here's what I'm using.

On Oct 19, 2021, at 3:45 PM, Zvika @.***> wrote:

I've followed what you wrote here, and it fails with the same error. Is it possible that its OS-related? Maybe ECJ's current master-slave code works only on Macs, and not on Windows and Linuxes (I've tried both)? Or maybe it's something specific to your JDK (again, I've tried few JDK versions, neither helped...)

Can you please check this on some Linux machine? Or maybe provide me with a clean example, that I can check on my machines?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://secure-web.cisco.com/1MORMQYbY85Ex0criHRHxJa8oZ5YGFs_ECzzM53-XLt9ym2Iid0eeq2a_9uwvyYFjGEllA1PaaInN7p5r48EcBrYYh1XxuN0tswdjILY-3vjUYI1OUNX5HdTtThDrEgLTR7AsdLeNaHREPhsNvpnvGRON5xELVxiLGKstDYR4MyUhJGNc8MEwAhdw8tJ4EAMXU7t4dYjYQzq_dTj38-8hFlieaDhdfeQRAt8wql-bk1AoipEcYLzx-aghNIKrURDZdC7f9cYuwPvbP-PifO2n_2_MBq3vf7vF3XzzRw5L002jh6i3rSegl-_U2pSzcob1ux_eSzTxaiMRYHSH0hGhUFock5877sOLiPUmRi1DN8_ia0_AiXRLCz3Fy3w33-vix1PtxR0az0q7aHdkTacDDaog0lPsVVDUpccZG0RVpTKNn9mXr-teGoLPxF_RUFbA/https%3A%2F%2Fgithub.com%2FGMUEClab%2Fecj%2Fissues%2F80%23issuecomment-947050641, or unsubscribe https://secure-web.cisco.com/1TBIbd_Ngw0l78AavDSNRlLhYnC0XT2Ecxo8lU07K97iLRLeZMWCz3E1OlvKJDx_WpMsYrECyfxbOdv4xOj5Zajy7tvqNisMijaJZ0KBAlewsU9HZVOusq9M4sf8bJS3QYN7IAfZTUvsDrP25CYuqqHlJ66-I9z_N4F9Y_l1lKohzRk6E3EFt1Gt1Uv4yFie!%20ykNZOuXIBbB7BxrObr51NcpnipsWvafHPeIV6dMNy61-enbhgwNSj5iiRwPqGD0CIlqWImq3-weqrnj0sVt7SE1EjwuL0uFg32fZigoijhE8fmpbd73lWDn8IiYPW41m1tvbEL6_-24_aUab_9tzS5Rk5yGADsp0lxla3TlKN3hV8IfCA0R5VvSJiP5fWgZr0_HM2-vZ8WiwHAdUxHeVZo49ofJwpJ8Ey1jetPosyXv7xnV3z_7ArpcE6NYeMghtP/https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADAZDVFHTG22RAA46AC6UH3UHXDETANCNFSM5CYYLZOA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ZvikaZ commented 2 years ago

Just did a test on a large redhat server of ours. Works great. Here's what I'm using.

  1. That's great news! Do you happen to know what redhat version is it? I want to try it as well.
  2. Maybe your answer was truncated? I see the sentence Here's what I'm using. without further details.
  3. Did you run the same test as described above (based on my environment, with your minor changes), or was it something else (if that's the case, can you please give details, so I can try it as well)?
  4. I've read again our discussion, and encountered:

    Now I have a few minor changes in my ECJ code that's not on the repository, but they're all with respect to the GroupedProblemForm hack I'm not sure I want to make official: they're pretty minor too

it's possible that these changes are the cause of the difference between my runs and yours. Can you please commit what you're working on (even as a temporary branch, not master), so we can make sure it's really not related? (or, can you test with clean ecj release?)

eclab commented 2 years ago

It's possible but pretty unlikely. I've already committed them.

On Oct 19, 2021, at 4:41 PM, Zvika @.***> wrote:

Just did a test on a large redhat server of ours. Works great. Here's what I'm using.

That's great news! Do you happen to know what redhat version is it? I want to try it as well. Maybe your answer was truncated? I see the sentence Here's what I'm using. without further details. Did you run the same test as described above (based on my environment, with your minor changes), or was it something else (if that's the case, can you please give details, so I can try it as well)? I've read again our discussion, and encountered: Now I have a few minor changes in my ECJ code that's not on the repository, but they're all with respect to the GroupedProblemForm hack I'm not sure I want to make official: they're pretty minor too

it's possible that these changes are the cause of the difference between my runs and yours. Can you please commit what you're working on (even as a temporary branch, not master), so we can make sure it's really not related? (or, can you test with clean ecj release?)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://secure-web.cisco.com/1YGAH2iDlFQUXAr1lvc_WuJARcnpV0CsVBJzvlq5pfcwt5eW0kXU_OenMjCUWMjV3Iv3u9zsjomV9ohsvvKF6YAOUz--Xx25aOBWi6Ts8twSgZx7tssPo_LxYjWibHfk2uOYqhIbRbBSmpDEj6lX2PPuqmjM4aPUzJigiqCioHAHn7zEYtFb0lcF_LUHPtVxDo50wd5xWquFLCUVS4mgLE8apTBiBU26qX0ZvuAgm3jR8Yr_ceiqCEVhbt8CBPixbtpMA5cTKlTfzaztSpFtrGlfGG3g_LxO3i-FeN2ydEzVavm7KgmPHPpBzOhgtvT7eO8hn9ueSfrhPiKamiWMHtFOl7GGJDOykjmSDelF9Fy4nsfSO1v0ymMdRU-UklXZmPJ2qJZuKjChJ9x6LXJ5npg_0HVX4kuO78AxSo1bQMzei-CY8-fJcGbVQpLWL54wy/https%3A%2F%2Fgithub.com%2FGMUEClab%2Fecj%2Fissues%2F80%23issuecomment-947088830, or unsubscribe https://secure-web.cisco.com/1qLztyudtHzzbnatOe2b-uT7k3sC0q2SP_-FZlRGIAQ0mPa16jM1hGNVHLiOOyYeYh51BezCglaD7y9tbbWj1gxvXiObZELBGXcldz0jwBPMQb_6OTsBguWPoIlOAvgHa8X5qg4M8OD-R40YRxH8DKeduq-p4leN5duLLa0DDN5H1sAoIcprPTHP_A-urvFU!%20I0Omae-NKRbEpJOuqL74e4Cbv1cCRm7mwOc5h0v5K4fS0PT9X60IfA2qBo2bLJtyvEuQU-RdU7uf-KJT8Q_UY-rUhsRxfib6XOrPSOUvk9KHg5v0jc4LWtF1PofR2jrJnWsr8NwcO1vOFrcZOfJc-olUwk2pFHPpMcsXcEll49Yotx3NeOYmHufGgTBXSrCBOAdi0VkOoumn2yWLwoddgA3wrGah3DrQ6OtWjXU9JHcGIqtGfkw_TNxeOjOSjHB3_/https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADAZDVBEJQ4S3E23VFIC46TUHXJY7ANCNFSM5CYYLZOA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ZvikaZ commented 2 years ago

Well, it appears that I was in the correct road, but in the opposite direction :-) The problems were due to my local changes in ECJ. Once I dropped them, and used your last commit, or release 27, everything works fine...

Thanks for the help debugging this!