LORD-MicroStrain / MSCL

MicroStrain Communication Library
https://www.microstrain.com/software/mscl
MIT License
76 stars 57 forks source link

Datalog Download from Nodes fail despite generous amount of retries and timeouts #372

Open matzefit opened 9 months ago

matzefit commented 9 months ago

Hi,

I have a synchronized sampling network, with 7 wirelessnodes in it (G Link 200 8G) and a basestation (WSDA200USB). I wrote a script in Python that:

  1. connects basestation, defines a synced network, adds nodes to the network
  2. sets nodes to idle (max timeout at 2mins),
  3. configures nodes (with retries and catches all errors)
  4. configures network (with retries and catches all errors)
  5. starts sampling with triggers, runs for a certain duration (indicated through time.sleep(Duration) within python)
  6. sets nodes to idle (timeout at 2mins)
  7. Downloads the Data with 200 max. retries, each with 1sec wait in between - creates csv files for each node, also uploads to my cloud
  8. Downloads the Diagnostic Data for each node and uploads to my Cloud
  9. Restarts the whole cycle from 1.

Im doing the actual data download as per the guidelines in this github repo here: 1st Step: Initiate a Datalogdownloader:

            downloader = mscl.DatalogDownloader(node)

Again with a try and catch around it and max 5 retries to define the donwloader for each node. 90% of cases succesful

2nd Step: Loop until downloader.complete() is true with 200 retries and 1sec wait in between:

        # Loop until all data has been downloaded
        while retry_counter > 0:
            try:
                if downloader.complete():
                    download_completed = True
                    print("Data download complete. CSV file created at", csv_filename)
                    break  # Exit the loop through the nodelist if download is complete
                sweep = downloader.getNextData()
                timestamp = datetime.fromtimestamp(sweep.timestamp().seconds())
                sweep_data = {'Timestamp': timestamp.strftime("%Y-%m-%d %H:%M:%S.%f")}

This procedure is part of the function DownloadDataAllNodes that i wrote that downloads and syncs data with google drive all in one.

The question: I wonder why the download of the data is so extremely unreliable?

Setting nodes to idle, configuring them, do a network check, starting the sampling, setting nodes to idle are all procedures that work really well, no nodes are normally lost during these steps. But when kicking off the downloads, there are always 3-5 nodes out of all 7 (and never the same ones) that actually go through all the 200 retries to then fail to download the data (which has in average after each sampling epoch ca. 1mb). And the nodes arent even spread out, theyre all in one spot ca 30cm away from the basestation. Spreading the nodes out to avoid signal interference doesnt help the issue. I get that its a wireless connection which is prone to connectivity issues, but I doubt that wireless connections generally are that unreliable. Is there a fix to my problem? Maybe a bit longer of a timeout in some point of my function?

Thanks.

Find the entire DatalogDownloader function that i wrote below (before this the nodes are all confirmed to be in idle):

def DownloadDataAllNodes(nodelist,network):
    time.sleep(15) #rest the nodes in idle, before starting the Downloader sessions
    print('-------Starting Data Download for all nodes')
    print ("Network OK: ", network.ok())
    for node in nodelist:
        Datalogretries = 5  # Number of retries for initiating downloader
        downloader = None
        # Try to start the datalog downloader session
        while Datalogretries > 0:
            try:
                downloader = mscl.DatalogDownloader(node)
                print(f"Downloader Session started successfully for {node.name()}")
                time.sleep(15) #rest the procedure, potentially increases sucess for downloads (?)
                break  # Break if downloader session started successfully
            except mscl.Error_NodeCommunication as e:
                print(f"Communication error with Node {node.name()}: {e}")
                Datalogretries -= 1
                time.sleep(1)  # Wait for a second before retrying
        if downloader is None or Datalogretries == 0:
            print(f"Failed to initiate Datalogdownloader with Node {node.name()} after several attempts")
            continue  # Skip to the next node if downloader couldn't be initiated
        MAX_RETRIES = 200 #200 Try's to download the data
        retry_counter = MAX_RETRIES
        header_written = False
        csvFileTimestamp = datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
        csvFileNodeName = node.name()
        csv_filename = f"Data/node{csvFileNodeName}-{csvFileTimestamp}.csv"
        data_folder = "Data"
        if not os.path.exists(data_folder):
            os.makedirs(data_folder)
        with open(csv_filename, mode='w', newline='') as file:
            writer = csv.writer(file)
            download_completed = False
            # Loop until all data has been downloaded
            while retry_counter > 0:
                try:
                    if downloader.complete():
                        download_completed = True
                        print("Data download complete. CSV file created at", csv_filename)
                        break  # Exit the loop if download is complete
                    sweep = downloader.getNextData()
                    timestamp = datetime.fromtimestamp(sweep.timestamp().seconds())
                    sweep_data = {'Timestamp': timestamp.strftime("%Y-%m-%d %H:%M:%S.%f")}
                    if downloader.metaDataUpdated():
                        currentSampleRate = downloader.sampleRate()
                        currentCalCoefs = downloader.calCoefficients()
                    for dataPoint in sweep.data():
                        channelName = dataPoint.channelName()
                        channelId = dataPoint.channelId()
                        channelNumber = dataPoint.channelNumber()
                        value = dataPoint.as_float()
                        if not sweep.calApplied():
                            calCoef = currentCalCoefs[channelId]
                            value = (value * calCoef.linearEquation().slope()) + calCoef.linearEquation().offset()
                        sweep_data[channelName] = value
                    if not header_written:
                        writer.writerow(sweep_data.keys())
                        header_written = True
                    writer.writerow(sweep_data.values())
                except mscl.Error_NoData:
                    download_completed = True
                    print("No more data available from the Node.")
                    break
                except Exception as e:
                    retry_counter -= 1
                    print(f"Communication error with Node {node.name()}. Retrying... ({MAX_RETRIES - retry_counter}/{MAX_RETRIES}). Error: {e}")
                    time.sleep(1) #wait for 1 sec before retrying datadownload with downloader.getNextData()
            if retry_counter == 0:
                print("Reached maximum retry attempts. Data download was not successful for Node", node.name())
            if download_completed:
                try:
                    command = f"rclone copy '{csv_filename}' GOOGLEDRIVELINK/csvfile'"
                    subprocess.run(command, shell=True, check=True)
                    print("Data successfully synced with Google Drive.")
                    erase_retries = 5
                    while erase_retries > 0:
                        try:
                            node.erase()
                            print(f"Cleared Storage on Node {node.name()}")
                            break  # Break from the loop if erase is successful
                        except Exception as e:
                            erase_retries -= 1
                            print(f"Failed to clear data on Node {node.name()}, retrying... ({5 - erase_retries}) Error: {e}")
                            time.sleep(1)  # Wait for 1 second before retrying
                        if erase_retries == 0:
                            print(f"Failed to clear data on Node {node.name()} after multiple attempts.")
                except subprocess.CalledProcessError as e:
                    print("Failed to sync data with Google Drive:", e)