Closed calebjordan closed 8 years ago
We haven't been able to find a reproducible set of circumstances here. How often is this happening? What OS is the host computer running? If it's happening frequently is it possible to setup a little demo that crashes it?
It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.
The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.
This is Windows 10.
We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.
Also, we are running very current (pulled mid last week) versions of QGL, PyQLab, and Qlab.
I'm guessing this may not be an easy thing to figure out right away, but if you have any tips on how we could reset things without having to do a full computer restart, that would help tremendously. We can deal with restarting Matlab, but the measurement computer still has an old HD, and power cycling is cumbersome.
Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?
On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:
It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.
The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.
This is Windows 10.
We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .
I would also try not to unload the library from Matlab when things fail. Just close Matlab and kill the Matlab processes.
On Jul 18, 2016 4:12 PM, "Matthew Ware" mware87@gmail.com wrote:
Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?
On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:
It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.
The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.
This is Windows 10.
We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .
Closing Matlab and killing the process without attempting to unload the library results in the APS being completely invisible. enumerate
doesn’t work, and new Matlab instances can’t find them either. Or at least it did twice, so we’ve tried to not attempt it since.
We can try running them in CW and see if there’s a difference.
On Jul 18, 2016, at 4:15 PM, Matthew Ware notifications@github.com wrote:
I would also try not to unload the library from Matlab when things fail. Just close Matlab and kill the Matlab processes.
On Jul 18, 2016 4:12 PM, "Matthew Ware" mware87@gmail.com wrote:
Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?
On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:
It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.
The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.
This is Windows 10.
We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233444901, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOGdHPahxayGWZ4gL-xhIfpUiWOEICUks5qW97wgaJpZM4JPA0X.
Have you tried to do enumerate a few times? Sometimes I need to do it ~10 times until the APSs appear again.
Experiment json's and h5 sequence files would be helpful to try and recreate here.
Diego, does enumerate seg fault for you, or just find 0 units? Ours seg faults. I don’t think I tried it more than 3 or 4 times.
Colm, I’ll have Matt zip those up and send them along.
On Jul 18, 2016, at 4:27 PM, Colm Ryan notifications@github.com wrote:
Experiment json's and h5 sequence files would be helpful to try and recreate here.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233448064, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOGdA2LzW1wP7-bol0xOyQgGw-LEEsYks5qW-GugaJpZM4JPA0X.
You may need to enumerate for a while to wait for the OS to release the previously bound socket. We can fiddle with the TCP timeout and retry settings to work around this but we shouldn't have to.
enumerate
should never seg fault though.
Currently, so long as we power cycle the APS and then unload the library in Matlab, we can immediately repeat the experiment. So I don't think we'll have to restart things anymore.
Still crashes in CW mode.
.h5 files attached.
@calebjordan could you also attach the ExpSettings.json
so I can try and recreate that part too. Also what version of Matlab are you running? We've anecdotally noticed a difference between 2015a and 2015b.
We're using 2013b. I'm not sure we can jump to any of the '15 releases, but we can jump to 2016a. I wasn't sure if you guys had tested it out yet.
Hmmm... no luck reproducing it here. I've run the experiment a number of times without any issues. I'll leave it looping over night and see if anything happens.
And it ran 100x overnight with no issues. There must be some other quirks in the setup we're missing.
Hm. We may try a Matlab upgrade today to see if there's any difference. Do you guys recommend 15a or 15b? We can also try 16a.
@dieris and @matthewware, do you have any tips on reproducing this? Does it occur frequently enough?
Also, since the error is a timeout error, is it possible that something as simple as increasing the timeout period could help mitigate this? I could recompile with a longer timeout and see if that helps.
I am using 2015a, but I still get timeouts. I will try to reproduce the error on my machine. It happens more often to me during long acquisitions (high number of round robins) and 2d sweeps (e.g. repeat).
You could try lengthening the timeout. It's defined here. I was too aggressive with it earlier but it's pretty long now at 3s. If the APS2 hasn't responded by then it's probably dead.
Increasing the timeout didn't change anything.
@dieris, which firmware version are you using?
We're going to start downgrading until we get to something that works. I'll keep you updated on which driver/firmware versions are giving us this problem.
Thanks, @calebjordan. In the meantime I am going to look at seeing if there is some more graceful way we can recover from comms failures.
I'm running firmware 4.1
Running with 4.0 for 30 minutes or so, hasn't crashed yet. Probably the longest it's gone without this issue all day. Maybe too early to prove anything, but seems promising so far.
Sounds Promising! Let us know if it's more stale for you.
APS2 still times out. Was well behaved for many hours though. Going to revert to v3.2 soon.
Timeout using 3.2 as well (using latest libaps2.dll).
Reverted back to the 0.6 release. Seems to be running fine. We're still able to use the pulses compiled by the latest QGL, which is nice.
Everything running smoothly again with the 0.6 release. We may be warming up the fridge soon, at which point we can play with running other firmware versions until we pinpoint where things break. For now it was just too much of a hindrance on experiments.
That suggests that the issue is with our TCP/IP stack, since 0.6 is UDP only. I've started working on a way to reset the TCP/IP stream on failure, which should help.
I think @caryan and I discovered why MATLAB crashed when you lost connectivity. c2ac0d10ca3925b257951a9bb7bbc666ed9a687c should fix that. We're still trying to figure out why the connection gets dropped, though.
Is there something we can run to help you guys diagnose this? Maybe a simple loop we could make to mimic communication during an experiment? Our DR is warm for the next week, so we're free to play around with it again.
@calebjordan we may have the scent of this one now. We upgraded our TCP/UDP core and are now seeing significant problems with the UDP side of things. I'm hopeful that if we sort that out we'll have a fix.
@calebjordan We have a new firmware build that survived 20,000 connect
->load_sequence
->disconnect
cycles. Would you mind giving this new firmware a try to see if it still fails on you?
@calebjordan Have you had a chance to try the new firmware?
Been a bit distracted for most of last week, I'll try to get this loaded and running this afternoon. I'll keep this issue updated.
Running continuously for ~1 hour now, no crashes. I'll update again tonight and tomorrow morning if it lasts that long (fingers crossed).
Ran for another 2 hours and then crashed. Same error as before.
MATLAB crashed? Or you got an APS2 comms timeout error?
Sorry, APS2 timeout error. MATLAB did not crash. I was able to power cycle the aps2 and continue in the same Matlab instance.
Stepping sweep 2: 1000 points ( 43%)Warning: The following error was caught while executing 'onCleanup' class destructor:
APS2 library call failed with message: Timed out while waiting to receive data.
> In ExpScripter at 52
Error using APS2.check_status (line 260)
APS2 library call failed with message: Timed out while waiting to receive data.
Error in APS2/aps2_call (line 44)
APS2.check_status(status);
Error in APS2/stop (line 92)
aps2_call(obj, 'stop');
Are you also using libaps2 driver built from current master?
Next time you get the system into an error state, could you try running this snippet of Julia code:
# put the IP address of the failing module below
tcp_sock = connect(ip"192.168.5.11", 0xbb4e)
datagram = UInt32[0x10000002, 0x44a00050]
write(tcp_sock, map(hton, datagram))
resp_header = map(ntoh, read(tcp_sock, UInt32, 2))
println("response header: $resp_header")
uptime_array = map(ntoh,read(tcp_sock, UInt32, 2))
uptime = uptime_array[1] + 1e-9*uptime_array[2]
println("Uptime $uptime")
disconnect!(tcp_sock)
The driver was built from the master that was current when this issue was opened. I'll pull down the most current and rebuild and try again.
In the meantime, your julia script returns
response header: UInt32[0x10000002,0x44a00050]
Uptime 3236.07875925
ERROR: LoadError: UndefVarError: disconnect not defined
EDIT: Also, before when the APS2 timed out, L1 and L2 would be solid green until a power cycle. Now it's only a solid L1, and I haven't need to power cycle the unit at all. I was able to run the julia script and then re-open matlab and re-run the test experiment.
There's a typo in the Julia script. It should be disconnect!
Using the latest master ( 8f4ad8b) driver.
Now, L1 and L2 are solid again. And the julia script fails to connect ERROR: LoadError: connect: connection timed out (ETIMEDOUT)
. MATLAB crashes if I attempt to connect as well. I'll have to power cycle to run anything again.
Here's another Julia snippet you can use to try to force a TCP reset:
udp_sock = UDPSocket();
bind(udp_sock, ip"0.0.0.0", 0xbb4f)
send(udp_sock, ip"192.168.2.2", 0xbb4f, [0x02]) # replace ip address on this line
close(udp_sock)
@calebjordan we have a new release that has been running for the last few hours without problem. We'll post it later this afternoon.
New release posted, version 4.2. Please give it a try @calebjordan and let us know how it goes.
We were able to run a 12+ hour experiment with no problems. The problem seems to be solved.
Thanks guys!
Closing.
Glad to hear it, @calebjordan.
We're having issues with the APS2 timing out mid-experiment. Not enough datapoints to relate to the experiment parameters, but we're doing simple
[[MEAS(q1)]]
pulses every 35us. When this happens, running the command lineenumerate
and other APS2 functions all fail. And it seems like unless we try to unload the libaps2 library within Matlab before closing it (which usually crashes Matlab), we have to do a full computer and APS2 restart in order to restore functionality.Matt says he and Diego have had similar issues.
Any ideas for how to fix this or at least make it more usable in the meantime?