atsign-foundation / at_libraries

Support libraries & dependencies for Atsign's technology
https://pub.dev/publishers/atsign.org/packages
BSD 3-Clause "New" or "Revised" License
38 stars 11 forks source link

at_lookup causes a crash when lookups fail to root (possibly other places too) #156

Open cconstab opened 2 years ago

cconstab commented 2 years ago

Describe the bug If a device is offline then at_lookup can cause the application to crash rather than wait for connectivity to come back. Looks to be a failure in name lookup in this case but if the network is down nothing @ should crash it should wait until network is available..

To Reproduce Steps to reproduce the behavior:

  1. Create a small app that connects to a secondary (ColinSnippets/ssh_control works)
  2. Then run the program whilst having no network connection
  3. And then watch

Expected behavior @ apps should never crash and should handle being offline gracefully and reconnect when possible

Screenshots

pi@raspberrypi:~/Colin-snippets/ssh_control $ bin/ssh_control
initializing storage
INFO|2022-03-31 03:00:52.990549|HiveBase|commit_log_f15959d1046b21a3e727245571dcd5c697956835e967a0a273db44d5681ac682 initialized suc
cessfully

AtServer.getHiveSecretFromFile file found
INFO|2022-03-31 03:00:53.003552|HiveBase|f15959d1046b21a3e727245571dcd5c697956835e967a0a273db44d5681ac682 initialized successfully

SEVERE|2022-03-31 03:00:53.016314|AtLookup|AtLookup.findSecondary connection to root.atsign.org exception: SocketException: Failed h
ost lookup: 'root.atsign.org' (OS Error: Temporary failure in name resolution, errno = -3)

Unhandled exception:
Exception: Secondary server not found
#0      AtLookupImpl.createConnection (package:at_lookup/src/at_lookup_impl.dart:270)
<asynchronous suspension>
#1      AtLookupImpl._sendCommand (package:at_lookup/src/at_lookup_impl.dart:550)
<asynchronous suspension>
#2      AtLookupImpl.authenticate (package:at_lookup/src/at_lookup_impl.dart:415)
<asynchronous suspension>
#3      AtOnboardingServiceImpl.authenticate (package:at_onboarding_cli/src/at_onboarding_service_impl.dart:187)
<asynchronous suspension>
#4      main (file:///home/pi/Colin-snippets/ssh_control/bin/ssh_control.dart:29)
<asynchronous suspension>
pi@raspberrypi:~/Colin-snippets/ssh_control $

Additional context This is critically important for IoT use cases and also for mobile apps

cconstab commented 2 years ago

@VJag would love your thoughts on this one and how to handle gracefully at the @ platform level.. Thanks

VJag commented 2 years ago

I will certainly analyse.

VJag commented 2 years ago

@cconstab I have captured my analysis here:

https://docs.google.com/spreadsheets/d/1KE22RrWzIKPvR1NTDZWKcfaU1sFnCEhJeHoDYjd3QBA/edit?usp=sharing

In the "Network usage analysis" tab I tried to capture the network usage, in the "Solution" tab tried to capture the new abstraction I am trying to propose.

Please let me know if my analysis is in line what was expected.

cconstab commented 2 years ago

Dealing with on/off and intermittent network needs to be core to @.. it's important in mobile but critical in IoT .

I think this needs to be looked at this sprint..

Thoughts ?

@gck @VJag @nickelskevin

gkc commented 2 years ago

I agree. Every network-related error needs to have clearly defined predictable well-tested behaviour, and also have reliable predictable well tested behaviour on reconnect

I'd like the focus for this sprint to be

1) Graceful reliable fully tested handling of intermittent network availability in the client libraries 2) Lots more tests ensuring we cover all recovery scenarios for inter-server connection errors.

The problems which we discovered in e2e tests this week are most easily discovered in unit tests where you can more easily control the environment by plugging in stubs and mocks. This is especially true for network interactions.

gkc commented 2 years ago

I'd add

3) Review how errors are passed from the underlying core libraries to application code, agree on any enhancements that need to be made, implement them 4) Ensure that the apps have clear visibility of all client-side state which is relevant - i.e. document and agree what enhancements need to be made in order to give the visibility that app code needs, and implement those enhancements

gkc commented 2 years ago

Tagging @sarika01 also. Making all of this happen will need close collaboration across all of the engineers irrespective of whether they've been more focussed on apps or client SDK or server - the more cross-pollination that happens, the better. It'd be great to see 'core' developers working on app widgets and 'app' developers working on 'core' libraries!