abraunegg / onedrive

OneDrive Client for Linux
https://abraunegg.github.io
GNU General Public License v3.0
10.06k stars 858 forks source link

Improve the "bypass_data_preservation" feature #1101

Closed mp1994 closed 4 years ago

mp1994 commented 4 years ago

Is your feature request related to a problem? Please describe. I am experiencing something similar to the issue #824 (it happens with .pdf files as well as with Office files, e.g. .docx or .pptx). I am running Ubuntu 18.04 in dual boot with Windows 10 and hence I had to mess around with the time settings to synch the time in the two OS's. I wanted to activate the bypass data protection feature, but I guess there is a safer alternative.

Describe the solution you'd like It would be nice to check the output the bash command diff to check whether the file has actually been modified or if it is only a matter of timestamps. According to my tests, diff has an empty output in the latter case, while of course it would list the differences in case of conflicts.

abraunegg commented 4 years ago

@mp1994 What account type are you seeing your issue with?

When this was last looked at, on file upload, Microsoft is modifying your file (no way to stop) - thus, file IS different, thus this is why the bypass data preservation was written in addition to #824

The file is check with 3 elements:

I am confident when you check your files, Microsoft is adding metadata to your files post upload, thus, they are technically different

Additionally, if you have the same software running in the background on Ubuntu as #824 ... you need to get those peices of software fixed ...

mp1994 commented 4 years ago

@mp1994 What account type are you seeing your issue with?

When this was last looked at, on file upload, Microsoft is modifying your file (no way to stop) - thus, file IS different, thus this is why the bypass data preservation was written in addition to #824

The file is check with 3 elements:

  • timestamp
  • size
  • checksum as provided by OneDrive

I am confident when you check your files, Microsoft is adding metadata to your files post upload, thus, they are technically different

Additionally, if you have the same software running in the background on Ubuntu as #824 ... you need to get those peices of software fixed ...

I am using OneDrive for business (Microsoft 365 account, or whatever that’s called) I have no additional software running, and I installed the script from this repo on a brand new laptop. I’ll try to make some further tests to see whether the feature addition I suggested makes sense

abraunegg commented 4 years ago

@mp1994

I am using OneDrive for business (Microsoft 365 account, or whatever that’s called)

OK .. most certainly Microsoft is adding / updating metadata after upload

I have no additional software running, and I installed the script from this repo on a brand new laptop.

Given that you installed Ubuntu 18.04, (and there are better options out there IMHO ..) you potentially have something like the following running in the background:

Packages like these 'could' potentially be indexing your files under Ubuntu. These sort of packages are broken - they update the modified time stamp of the file, without making any changes to the files - they do that as they use the modified time stamp to indicate if the file has been indexed ... very poor way of tracking what was indexed.

I’ll try to make some further tests to see whether the feature addition I suggested makes sense

Run the client in verbose debug log mode (--verbose --verbose) - this will show you how and what is being compared & the results leading to the actions being taken. If the action is that the timestamp is different + newer on Linux (as compared to online version) - and you did not modify the file, then you need to look at what is modifying your files in the background.

Your enhancement request right now does not make sense given the multi-level checks on a file that is done at present. To the client, if the local file 'is different' .. what modified it, why is it different - you need to hunt that down if you did not modify it.

I wanted to activate the bypass data protection feature, but I guess there is a safer alternative.

What this option does is, if the 'online' version is 'different' to the local copy (after comparing size, timestamp, file hash), it does not rename the local copy that is 'technically different', rather than replaces it with the online version. By renaming the existing item, it gives you a 'backup' of the file, in case something is really wrong so you do not loose data. By enabling this option, you loose that protection mechanism.

There is no 'safer' alternative here.

mp1994 commented 4 years ago

So, I have been doing some tests. I typically have this issue when I want to quickly edit a file and I need Windows to do that. I have set up my machine to virtualize the Win10 partition I use for the dual boot, so my typical workflow is:

I used Excel on Windows to edit a file, let's call it BOM.xlsx. Then, I issued a dry run sync (onedrive --dry-run --verbose --verbose --synchronize). Here is an interesting part of the log output (I have removed the full path to the file):

[DEBUG] The item we are syncing is a file
The local item has a different modified time 2020-Oct-21 17:51:26Z remote is 2020-Oct-21 17:40:47Z
The local item has a different hash
Remote item modified time is newer based on UTC time conversion
The local item is out-of-sync with OneDrive, renaming to preserve existing file and prevent data loss: (...)/BOM.xlsx -> (...)/BOM-MP-XPS13.xlsx
[DEBUG] DRY-RUN: Skipping local file rename

Rather interestingly, the local time (2020-Oct-21 17:51:26Z) matches the time of the last change (cross-checked also with OneDrive web app), while the remote modified time is previous to that (2020-Oct-21 17:40:47Z) Running the sync without the dry-run option does in fact generate the duplicate file. It does not happen always: I tried doing the same thing for another file (a random .docx that I haven't been opening for months) and I got no duplicates.

How's that happening? Will the bypass_data_protection flag solve this, as is? Will it also prevent data loss, that should not happen following the workflow I described above?

abraunegg commented 4 years ago

@mp1994 OK ... so given you have provided no configuration details lets start there. Please provide:

  1. Output of onedrive --display-config
  2. Your 'sync_dir' path - is that 'shared' between Linux and Windows or is it unique for each instance?
  3. Is this sync dir on a network path / shared resource?

Rather interestingly, the local time (2020-Oct-21 17:51:26Z) matches the time of the last change (cross-checked also with OneDrive web app), while the remote modified time is previous to that (2020-Oct-21 17:40:47Z)

OK .. looking at the code here in regards to this, whilst keeping in mind your usage scenario, in 2 out of 3 cases, this messaging is correct (there is a difference) but the source (remote) is wrong. This is a output display bug only and I will fix that shortly so that it correctly references the source of comparison. Right now, it is always advising 'remote' but the data for 2 of the comparisons is coming from the local database - not remote. This is potentially the source of confusion.

Remote item modified time is newer based on UTC time conversion

When this log output is done, this is correct however - it is directly comparing the OneDrive API response for the item, with the modified timestamp of the local file directly, and not comparing any database entry as per above. In each instance, there is zero logging error here as per above. I will added debugging output with this message, so that it is clear from a timestamp perspective in debug logs what the comparison was.

Running the sync without the dry-run option does in fact generate the duplicate file. It does not happen always: I tried doing the same thing for another file (a random .docx that I haven't been opening for months) and I got no duplicates.

The --dry-run option shows you what would happen, so when you do not use that, this is why actions occur - ie, files are created / duplicates created as a method of data preservation.

How's that happening? Will the bypass_data_protection flag solve this, as is? Will it also prevent data loss, that should not happen following the workflow I described above?

When you enable / use the bypass_data_protection the duplicate files will not be created. It will not prevent data loss - as, a backup of the file is not being created that is being replaced, but based on your workflow - this issue is being caused by your workflow, thus, enabling bypass_data_protection should have a zero impact for you in regards to data loss.

mp1994 commented 4 years ago
1. Output of `onedrive --display-config`
$ onedrive --display-config
Configuration file successfully loaded
onedrive version                       = v2.4.2-3-ged1d13b
Config path                            = /home/mattia/.config/onedrive
Config file found in config path       = true
Config option 'check_nosync'           = false
Config option 'sync_dir'               = /media/mattia/8C1E710B1E70EF96/Users/Mattia/OneDrive - Politecnico di Milano/
Config option 'skip_dir'               = FOTO|Microsoft Teams Chat Files|Microsoft Teams Data
Config option 'skip_file'              = ~*|.~*|*.tmp|*.url|*.ini|desktop.ini
Config option 'skip_dotfiles'          = true
Config option 'skip_symlinks'          = true
Config option 'monitor_interval'       = 300
Config option 'min_notify_changes'     = 5
Config option 'log_dir'                = /var/log/onedrive/
Config option 'classify_as_big_delete' = 1000
Config option 'sync_root_files'        = false
Selective sync configured              = false
2. Your 'sync_dir' path - is that 'shared' between Linux and Windows or is it unique for each instance?

Yes, the path is indeed shared: it's the OneDrive sync dir under Windows. I mount the partition and use that under Ubuntu too. I know it's potentially dangerous but it allows to optimize disk utilization. Anyways, I guess that at the synchronization, the local file and the remote file should be identical, as the local one is the same for Ubuntu and Windows.

I'm rather convinced that the bypass_data_protection is enough to solve this, but I guess there could be also a way to avoid these duplicates without enabling that feature.

abraunegg commented 4 years ago

@mp1994

Yes, the path is indeed shared: it's the OneDrive sync dir under Windows. I mount the partition and use that under Ubuntu too. I know it's potentially dangerous but it allows to optimize disk utilization.

OK .. so some caveats here as well. When using --monitor (which you have indicated you are doing) if the file system is exFAT, NTFS or however else you are mounting it (something other than a Linux file system) - inotify for local changes most likely will not be occurring, thus also contributing to your issue. The only way local changes to local files will be picked is is when the sync process is running.

I'm rather convinced that the bypass_data_protection is enough to solve this, but I guess there could be also a way to avoid these duplicates without enabling that feature.

There is no other way to solve this, as the 'database' thinks the file is X, but it is Y, because it was modified via another process, with the client not running - this is why there is no better way to solve this.

To assist with clarifying the timestamp issue, please can you use the following PR:

git clone https://github.com/abraunegg/onedrive.git
cd onedrive
git fetch origin pull/1103/head:pr1103
git checkout pr1103
./configure; make clean; make;

When running the PR version, the version will be: onedrive v2.4.6-9-g9ac70c0 or greater.

mp1994 commented 4 years ago

The only way local changes to local files will be picked is is when the sync process is running.

Uhm, what do you mean here? If I create a new file in my OneDrive folder under linux, it will sync automatically at the next monitor interval.

git clone https://github.com/abraunegg/onedrive.git cd onedrive git fetch origin pull/1103/head:pr1103 git checkout pr1103 ./configure; make clean; make;



When running the PR version, the version will be: `onedrive v2.4.6-9-g9ac70c0` or greater.

I will test this soon. Just to make sure, where should I install this? Should I replace my current installation with this PR?

abraunegg commented 4 years ago

@mp1994

Uhm, what do you mean here? If I create a new file in my OneDrive folder under linux, it will sync automatically at the next monitor interval.

Normally, when running --monitor - local changes are detected using inotify - thus upload near-real-time. If the local file system does not support inotify (which I suspect is the case for you), local changes will be uploaded, when using --monitor only when the next sync cycle occurs, which is by default 300 seconds.

Because of the way you have configured things, you most likely will not be utilising inotify.

I will test this soon. Just to make sure, where should I install this? Should I replace my current installation with this PR?

Run the PR from the PR directory to ensure that the code changes reflect the logging change. You can install it - this is your choice.

mp1994 commented 4 years ago

This is what I get from onedrive version v2.4.2-3-ged1d13b

The item we are syncing is a file
The local item has a different modified time 2020-Oct-22 16:44:01Z remote is 2020-Oct-22 16:35:07Z
The local item has a different hash
Remote item modified time is newer based on UTC time conversion
The local item is out-of-sync with OneDrive, renaming to preserve existing file and prevent data loss: TEST_123/Test.docx -> TEST_123/Test-MP-XPS13.docx

And this what I get from the PR version (onedrive v2.4.6-9-g9ac70c0)

[DEBUG] OneDrive change is a new local item
The local item has a different modified time 2020-Oct-22 16:44:01Z when compared to remote modified time 2020-Oct-22 16:44:02Z
[DEBUG] The item to sync is already present on the local file system and is in-sync with the local database
[DEBUG] Inserting item details to local database

So now it appears the timestamps are the same and hence no conflict is created

abraunegg commented 4 years ago

@mp1994

This is what I get from onedrive version v2.4.2-3-ged1d13b

This is why the support steps listed here - step 1 is check your version. It is not surprising an older version has / had a bug.

The only change in the PR version was output text - so this 'bug' you were seeing was also contributed to by you running old versions.

Please check your system and all systems for all 'onedrive' binaries and remove all expect the latest version from 'master' which currently is: onedrive v2.4.6-10-ga69e405

Issue ticket closed automatically when PR was merged.

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.