BBN-Q / libaps2

C/C++ driver for BBN APSv2
Apache License 2.0
4 stars 4 forks source link

APS2 Timeout mid-experiment #78

Closed calebjordan closed 8 years ago

calebjordan commented 8 years ago

We're having issues with the APS2 timing out mid-experiment. Not enough datapoints to relate to the experiment parameters, but we're doing simple [[MEAS(q1)]] pulses every 35us. When this happens, running the command lineenumerate and other APS2 functions all fail. And it seems like unless we try to unload the libaps2 library within Matlab before closing it (which usually crashes Matlab), we have to do a full computer and APS2 restart in order to restore functionality.

Matt says he and Diego have had similar issues.

Any ideas for how to fix this or at least make it more usable in the meantime?

caryan commented 8 years ago

We haven't been able to find a reproducible set of circumstances here. How often is this happening? What OS is the host computer running? If it's happening frequently is it possible to setup a little demo that crashes it?

calebjordan commented 8 years ago

It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.

The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.

This is Windows 10.

We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.

calebjordan commented 8 years ago

Also, we are running very current (pulled mid last week) versions of QGL, PyQLab, and Qlab.

I'm guessing this may not be an easy thing to figure out right away, but if you have any tips on how we could reset things without having to do a full computer restart, that would help tremendously. We can deal with restarting Matlab, but the measurement computer still has an old HD, and power cycling is cumbersome.

matthewware commented 8 years ago

Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?

On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:

It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.

The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.

This is Windows 10.

We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .

matthewware commented 8 years ago

I would also try not to unload the library from Matlab when things fail. Just close Matlab and kill the Matlab processes.

On Jul 18, 2016 4:12 PM, "Matthew Ware" mware87@gmail.com wrote:

Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?

On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:

It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.

The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.

This is Windows 10.

We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .

calebjordan commented 8 years ago

Closing Matlab and killing the process without attempting to unload the library results in the APS being completely invisible. enumerate doesn’t work, and new Matlab instances can’t find them either. Or at least it did twice, so we’ve tried to not attempt it since.

We can try running them in CW and see if there’s a difference.

On Jul 18, 2016, at 4:15 PM, Matthew Ware notifications@github.com wrote:

I would also try not to unload the library from Matlab when things fail. Just close Matlab and kill the Matlab processes.

On Jul 18, 2016 4:12 PM, "Matthew Ware" mware87@gmail.com wrote:

Are you running sequences with multiple segments? If not could you try running in CW mode vs not CW?

On Jul 18, 2016 4:05 PM, "Caleb" notifications@github.com wrote:

It's happening every 5-10 minutes or so. We're not able to run a scan longer than that without it failing, and short scans will sometimes fail as well. We don't think it's a function of the experiment, longer ones are just more likely to experience the timeout.

The experiment will stop arbitrarily, L1 and L2 go green, and then we get the timeout error.

This is Windows 10.

We can probably record a video of it crashing, or you could call in and watch. I'm not sure our exact experiment would be useful, but I can share the experiment JSONs if you think they'd be helpful.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233442239, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZcsxhAm5mTQUZulYx-47RGxnauARzVks5qW9yTgaJpZM4JPA0X .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233444901, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOGdHPahxayGWZ4gL-xhIfpUiWOEICUks5qW97wgaJpZM4JPA0X.

dieris commented 8 years ago

Have you tried to do enumerate a few times? Sometimes I need to do it ~10 times until the APSs appear again.

caryan commented 8 years ago

Experiment json's and h5 sequence files would be helpful to try and recreate here.

calebjordan commented 8 years ago

Diego, does enumerate seg fault for you, or just find 0 units? Ours seg faults. I don’t think I tried it more than 3 or 4 times.

Colm, I’ll have Matt zip those up and send them along.

On Jul 18, 2016, at 4:27 PM, Colm Ryan notifications@github.com wrote:

Experiment json's and h5 sequence files would be helpful to try and recreate here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BBN-Q/libaps2/issues/78#issuecomment-233448064, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOGdA2LzW1wP7-bol0xOyQgGw-LEEsYks5qW-GugaJpZM4JPA0X.

caryan commented 8 years ago

You may need to enumerate for a while to wait for the OS to release the previously bound socket. We can fiddle with the TCP timeout and retry settings to work around this but we shouldn't have to. enumerate should never seg fault though.

calebjordan commented 8 years ago

Currently, so long as we power cycle the APS and then unload the library in Matlab, we can immediately repeat the experiment. So I don't think we'll have to restart things anymore.

Still crashes in CW mode.

.h5 files attached.

TestH5.zip

caryan commented 8 years ago

@calebjordan could you also attach the ExpSettings.json so I can try and recreate that part too. Also what version of Matlab are you running? We've anecdotally noticed a difference between 2015a and 2015b.

mattai1986 commented 8 years ago

77_MH102-MOD7_5_Re8_CavPwr_cfg.zip

calebjordan commented 8 years ago

We're using 2013b. I'm not sure we can jump to any of the '15 releases, but we can jump to 2016a. I wasn't sure if you guys had tested it out yet.

caryan commented 8 years ago

Hmmm... no luck reproducing it here. I've run the experiment a number of times without any issues. I'll leave it looping over night and see if anything happens.

caryan commented 8 years ago

And it ran 100x overnight with no issues. There must be some other quirks in the setup we're missing.

calebjordan commented 8 years ago

Hm. We may try a Matlab upgrade today to see if there's any difference. Do you guys recommend 15a or 15b? We can also try 16a.

@dieris and @matthewware, do you have any tips on reproducing this? Does it occur frequently enough?

Also, since the error is a timeout error, is it possible that something as simple as increasing the timeout period could help mitigate this? I could recompile with a longer timeout and see if that helps.

dieris commented 8 years ago

I am using 2015a, but I still get timeouts. I will try to reproduce the error on my machine. It happens more often to me during long acquisitions (high number of round robins) and 2d sweeps (e.g. repeat).

caryan commented 8 years ago

You could try lengthening the timeout. It's defined here. I was too aggressive with it earlier but it's pretty long now at 3s. If the APS2 hasn't responded by then it's probably dead.

calebjordan commented 8 years ago

Increasing the timeout didn't change anything.

@dieris, which firmware version are you using?

We're going to start downgrading until we get to something that works. I'll keep you updated on which driver/firmware versions are giving us this problem.

blakejohnson commented 8 years ago

Thanks, @calebjordan. In the meantime I am going to look at seeing if there is some more graceful way we can recover from comms failures.

dieris commented 8 years ago

I'm running firmware 4.1

calebjordan commented 8 years ago

Running with 4.0 for 30 minutes or so, hasn't crashed yet. Probably the longest it's gone without this issue all day. Maybe too early to prove anything, but seems promising so far.

matthewware commented 8 years ago

Sounds Promising! Let us know if it's more stale for you.

calebjordan commented 8 years ago

APS2 still times out. Was well behaved for many hours though. Going to revert to v3.2 soon.

calebjordan commented 8 years ago

Timeout using 3.2 as well (using latest libaps2.dll).

Reverted back to the 0.6 release. Seems to be running fine. We're still able to use the pulses compiled by the latest QGL, which is nice.

calebjordan commented 8 years ago

Everything running smoothly again with the 0.6 release. We may be warming up the fridge soon, at which point we can play with running other firmware versions until we pinpoint where things break. For now it was just too much of a hindrance on experiments.

blakejohnson commented 8 years ago

That suggests that the issue is with our TCP/IP stack, since 0.6 is UDP only. I've started working on a way to reset the TCP/IP stream on failure, which should help.

blakejohnson commented 8 years ago

I think @caryan and I discovered why MATLAB crashed when you lost connectivity. c2ac0d10ca3925b257951a9bb7bbc666ed9a687c should fix that. We're still trying to figure out why the connection gets dropped, though.

calebjordan commented 8 years ago

Is there something we can run to help you guys diagnose this? Maybe a simple loop we could make to mimic communication during an experiment? Our DR is warm for the next week, so we're free to play around with it again.

blakejohnson commented 8 years ago

@calebjordan we may have the scent of this one now. We upgraded our TCP/UDP core and are now seeing significant problems with the UDP side of things. I'm hopeful that if we sort that out we'll have a fix.

blakejohnson commented 8 years ago

@calebjordan We have a new firmware build that survived 20,000 connect->load_sequence->disconnect cycles. Would you mind giving this new firmware a try to see if it still fails on you?

caryan commented 8 years ago

APS2_top_v4.1-19-gd586096.zip

blakejohnson commented 8 years ago

@calebjordan Have you had a chance to try the new firmware?

calebjordan commented 8 years ago

Been a bit distracted for most of last week, I'll try to get this loaded and running this afternoon. I'll keep this issue updated.

calebjordan commented 8 years ago

Running continuously for ~1 hour now, no crashes. I'll update again tonight and tomorrow morning if it lasts that long (fingers crossed).

calebjordan commented 8 years ago

Ran for another 2 hours and then crashed. Same error as before.

blakejohnson commented 8 years ago

MATLAB crashed? Or you got an APS2 comms timeout error?

calebjordan commented 8 years ago

Sorry, APS2 timeout error. MATLAB did not crash. I was able to power cycle the aps2 and continue in the same Matlab instance.

Stepping sweep 2: 1000 points ( 43%)Warning: The following error was caught while executing 'onCleanup' class destructor:
APS2 library call failed with message: Timed out while waiting to receive data. 
> In ExpScripter at 52 
Error using APS2.check_status (line 260)
APS2 library call failed with message: Timed out while waiting to receive data.

Error in APS2/aps2_call (line 44)
            APS2.check_status(status);

Error in APS2/stop (line 92)
            aps2_call(obj, 'stop');
blakejohnson commented 8 years ago

Are you also using libaps2 driver built from current master?

Next time you get the system into an error state, could you try running this snippet of Julia code:

# put the IP address of the failing module below
tcp_sock = connect(ip"192.168.5.11", 0xbb4e)

datagram = UInt32[0x10000002, 0x44a00050]
write(tcp_sock, map(hton, datagram))
resp_header = map(ntoh, read(tcp_sock, UInt32, 2))
println("response header: $resp_header")

uptime_array = map(ntoh,read(tcp_sock, UInt32, 2))
uptime = uptime_array[1] + 1e-9*uptime_array[2]
println("Uptime $uptime")

disconnect!(tcp_sock)
calebjordan commented 8 years ago

The driver was built from the master that was current when this issue was opened. I'll pull down the most current and rebuild and try again.

In the meantime, your julia script returns

response header: UInt32[0x10000002,0x44a00050]
Uptime 3236.07875925
ERROR: LoadError: UndefVarError: disconnect not defined

EDIT: Also, before when the APS2 timed out, L1 and L2 would be solid green until a power cycle. Now it's only a solid L1, and I haven't need to power cycle the unit at all. I was able to run the julia script and then re-open matlab and re-run the test experiment.

caryan commented 8 years ago

There's a typo in the Julia script. It should be disconnect!

calebjordan commented 8 years ago

Using the latest master ( 8f4ad8b) driver.

Now, L1 and L2 are solid again. And the julia script fails to connect ERROR: LoadError: connect: connection timed out (ETIMEDOUT). MATLAB crashes if I attempt to connect as well. I'll have to power cycle to run anything again.

blakejohnson commented 8 years ago

Here's another Julia snippet you can use to try to force a TCP reset:

udp_sock = UDPSocket();
bind(udp_sock, ip"0.0.0.0", 0xbb4f)
send(udp_sock, ip"192.168.2.2", 0xbb4f, [0x02]) # replace ip address on this line
close(udp_sock)
blakejohnson commented 8 years ago

@calebjordan we have a new release that has been running for the last few hours without problem. We'll post it later this afternoon.

blakejohnson commented 8 years ago

New release posted, version 4.2. Please give it a try @calebjordan and let us know how it goes.

calebjordan commented 8 years ago

We were able to run a 12+ hour experiment with no problems. The problem seems to be solved.

Thanks guys!

Closing.

blakejohnson commented 8 years ago

Glad to hear it, @calebjordan.