Backblaze / b2-sdk-python

Python library to access B2 cloud storage.
Other
183 stars 61 forks source link

Custom sync policy? #296

Closed retorquere closed 2 years ago

retorquere commented 2 years ago

Is there a way to create a custom sync policy? I want to keep some files in the bucket unchanged even though the local counterpart has these files with a newer timestamp. I've tried with exclude regexes but that will treat the excluded files as missing and deletes them in the bucket.

ppolewicz commented 2 years ago

Sure, you need to implement this interface: https://github.com/Backblaze/b2-sdk-python/blob/9ed5b69df2d0fc997b54bbdce756392ce082d449/b2sdk/sync/policy_manager.py#L17

and then you need to feed it into the sync. Sync was designed to be customizable so it shouldn't cause you much trouble. Let us know how it works out for you, or please ask if you run into trouble. Also, please observe that you can easily write fast tests using raw_simulator. This way you can write the testcase first and then once you have it running (on stock SyncPolicyManager) and failing, you can adjust the code and see it pass the tests when you are done.

retorquere commented 2 years ago

And then I'd have to create a custom sync policy class to instantiate and return from get_policy, right? Which inherits from AbstractFileSyncPolicy?

The policy I need is probably simple:

ppolewicz commented 2 years ago

I think you can use the existing policies, you'd just decide differently than the original synchronizer which policy to use. You can even inherit from SyncPolicyManager and do something in cases you care about, but if the case doesn't match your specific pattern,

    return super().get_policy(
        sync_type,
        source_path,
        source_folder,
        dest_path,
        dest_folder,
        now_millis,
        delete,
        keep_days,
        newer_file_mode,
        compare_threshold,
        compare_version_mode,
        encryption_settings_provider,
    )

to do whatever the standard synchronizer would do in those other cases.

retorquere commented 2 years ago

I don't understand. Is get_policy called per filename? What's the point where in intervene?

ppolewicz commented 2 years ago

It is called for every filename, yes.

I don't understand the second question.

retorquere commented 2 years ago

I thought a policy spanned all files in the sync. It seems I got that wrong -- get_policy is called per file-pair, and returns a policy object which implements some methods that decide what needs to be done with this pair. Which methods must this policy object implement?

ppolewicz commented 2 years ago

All the policy objects you need are already implemented (maybe with an exception of a "do nothing" policy). Click here to see an example policy (if you really want to go this way). Upwards from that line is the abstract object and it only has one abstract method.

retorquere commented 2 years ago

Ah OK, so my policy manager only needs to pick the right one! Got it. Yeah my needs are pretty niche -- I'm maintaining a deb repo in a bucket and I need all files locally to rebuild it, but I don't need to upload the unchanged files.

ppolewicz commented 2 years ago

I think in your case .deb files have different names for different versions, so b2 sync --compareVersions none would assume the files are exactly the same by just looking at the file name (not checking size, not checking mod time, not checking contents - just the file name). This mode is already supported in b2-sdk-python and in cli.

retorquere commented 2 years ago

The thing is any non-deb file must always be updated.

ppolewicz commented 2 years ago

Yes, you can run the sync process twice, once excluding everything non-Deb with --excludeRegex and then again for deb files only

Message ID: @.***>

retorquere commented 2 years ago

I must have done this wrong then previously. Excluded files are not touched at all, rather than being treated as absent? I thought that's what I saw. But if this two-stage sync should work, that's great.

ppolewicz commented 2 years ago

I think it should be possible as it is. If it's not, then perhaps b2cli should be adjusted to support it. If it won't work for you, please file an issue in b2cli repo with reproduction scenario.

no-response[bot] commented 2 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate or assist you further.

retorquere commented 2 years ago

you can easily write fast tests using raw_simulator. This way you can write the testcase first and then once you have it running (on stock SyncPolicyManager) and failing, you can adjust the code and see it pass the tests when you are done.

Is there documentation or sample code for this?

ppolewicz commented 2 years ago

I suggest you look at this: https://github.com/Backblaze/b2-sdk-python/blob/900df6b6921692cc8bc812c5d199b5989bdefcef/test/unit/bucket/test_bucket.py#L181 and other tests which inherit it

retorquere commented 2 years ago

I found that, but I don't see how that interacts with a custom policy manager.

retorquere commented 2 years ago

I suppose I must somehow pass this to sync_folders? I really can't make sense of how this all hangs together. Would it be possible to provide me with a scaffold that does policy selection and bucket emulation? I can probably take it from there.

ppolewicz commented 2 years ago

You should look at how the original sync uses it (all the way to B2 CLI if you have to). The simulator is installed at the B2Api level, which creates a Bucket object and then when you call sync between two Bucket objects you'll use whichever raw_api interface was provided for B2Api (raw_simulator for emulator or raw_api for production, probably).

retorquere commented 2 years ago

I don't really understand what's being said here. I should piece it together by taking the B2 CLI and stripping that until it does what I need? I already find it difficult to see the coordinated parts, the full B2 CLI will be more complex no?

ppolewicz commented 2 years ago

Ah, I know what's up. Synchronizer class has a bug which prevents users from easily replacing the policy manager. I'll try to fix it next week.

ppolewicz commented 2 years ago

@retorquere review #305 please, there is an example in a test now

retorquere commented 2 years ago

Given that this is a PR, how can I test this?

ppolewicz commented 2 years ago
git clone git@github.com:Backblaze/b2-sdk-python.git
cd b2-sdk-python
git checkout sync_customizability
pip install .
retorquere commented 2 years ago

I'm trying to forge the testcase into a standalone script, but I'm once again stuck. With this

from b2sdk.sync.policy import UpPolicy
from b2sdk.sync.action import B2DownloadAction, B2UploadAction, B2CopyAction, AbstractSyncEncryptionSettingsProvider, UploadSourceLocalFile
from b2sdk.sync.policy_manager import SyncPolicyManager

def assert_folder_sync_actions(self, synchronizer, src_folder, dst_folder, expected_actions):
  """
  Checks the actions generated for one file.  The file may or may not
  exist at the source, and may or may not exist at the destination.
  The source and destination files may have multiple versions.
  """
  actions = list(
    self._make_folder_sync_actions(
      synchronizer,
      src_folder,
      dst_folder,
      TODAY,
      self.reporter,
    )
  )
  assert expected_actions == [str(a) for a in actions]

class MySyncPolicyManager(SyncPolicyManager):
  def get_policy_class(self, sync_type, delete, keep_days):
    return UpPolicy

synchronizer = synchronizer_factory(
  compare_version_mode=CompareVersionMode.SIZE,
  keep_days_or_delete=KeepOrDeleteMode.DELETE,
  sync_policy_manager=MySyncPolicyManager(),
)
src = folder_factory('local', ('a.txt', [200], 11))
dst = folder_factory('b2', ('a.txt', [100], 10))
# normally_expected = [
#     'b2_upload(/dir/a.txt, folder/a.txt, 200)',
#     'b2_delete(folder/a.txt, id_a_100, (old version))'
# ]
expected = ['b2_upload(/dir/a.txt, folder/a.txt, 200)']
self.assert_folder_sync_actions(synchronizer, src, dst, expected)

I do not know where to get synchronizer_factory, and I also don't see where the raw_simulator comes in.

Is there really no documentation from which I could get a sense of what concepts are used and how they hang together (not api docs; api docs are for referral when I already know broadly how things work). I feel pretty shitty about asking for what I fear will just boil down to "implement this for me", but things are just not clicking at all for me.

ppolewicz commented 2 years ago

There is a classification of documentation types here: https://documentation.divio.com I think we do have a tutorial, a little bit of how-to-guide and a reference but we are short on the explanation side.

There are two things you need to do, lets do the customization first and once that works, lets move onto running things on a simulator.

Wehn using the version from the branch, you can pass a sync_policy_manager keyword argument to the constructor of Synchronizer. When you do that, it will use that custom policy manager instead of the default. If my understanding is correct, this is what you wanted to do, right? Does it work for you?

retorquere commented 2 years ago

Yes, this works.

retorquere commented 2 years ago

How does the b2sdk decide on the sync_type?

Sure, you need to implement this interface:

https://github.com/Backblaze/b2-sdk-python/blob/9ed5b69df2d0fc997b54bbdce756392ce082d449/b2sdk/sync/policy_manager.py#L17

When I copy this class verbatim as a start for a policy manager, I get

AttributeError: 'RepoPolicyManager' object has no attribute 'exclude_all_symlinks'
ppolewicz commented 2 years ago

Do not mistake SyncPolicyManager with ScanPolicyManager

retorquere commented 2 years ago

My current state is at https://gist.github.com/8895e20ea17afc14ec55cad441e8887c

retorquere commented 2 years ago

Wait what does the scan manager do then?

ppolewicz commented 2 years ago

ScanManager decides which files (and file versions) to care about - you can only sync .mp3 files, for example, if that's what you wish and ScanManager performs the filtering for you.

Line 106 should be sync_policy_manager = RepoPolicyManager(), not policies_manager = RepoPolicyManager()

retorquere commented 2 years ago

Oh interesting - why not just have the policy manager return a NoOpPolicy for files that we don't want to deal with rather than a separate manager?

ppolewicz commented 2 years ago

because ScanPolicyManager may skip over directories it doesn't want to handle and in your design it would have to emit a NoOpPolicy for every single file version in there (and there may be millions of file versions this sync operation should not care about)

retorquere commented 2 years ago

Got it. I think I have it working! So, how to add the simulator?

retorquere commented 2 years ago

because ScanPolicyManager may skip over directories it doesn't want to handle and in your design it would have to emit a NoOpPolicy for every single file version in there (and there may be millions of file versions this sync operation should not care about)

Just as a point of curiosity -- wouldn't a NoOpPolicy for the folder passed to get_policy do that? Not contesting the scan manager, just curious.

ppolewicz commented 2 years ago

Maybe it could be implemented your way. I actually prefer the current implementation because it keeps the policy decisions and the scan filtering in separate classes - merging those would create a big class that would be harder to work with.

You should try something like

api = B2Api(
    account_info, api_config=B2HttpApiConfig(_raw_api_class=RawSimulator)
)
retorquere commented 2 years ago

Maybe it could be implemented your way. I actually prefer the current implementation because it keeps the policy decisions and the scan filtering in separate classes - merging those would create a big class that would be harder to work with.

Fair enough.

You should try something like

api = B2Api(
    account_info, api_config=B2HttpApiConfig(_raw_api_class=RawSimulator)
)

I can do that but I'm not sure what this changes about the behavior. If there's a simulator I assume I can feed it pseudo-files to mimic the actual behavior?

ppolewicz commented 2 years ago

it's a cloud-in-ram, so please don't upload large files or you'll run out of memory. You can create a local directory with some files and you can sync it to the fake api, then inspect to see if it synced the files you wanted to sync and that it didn't sync the files you didn't want to sync. You can then modify the local directory and run the sync again, and inspect again - without waiting for a latency for communication with a cloud on the other side of the internet, so way faster. To ensure b2-sdk-python quality we run hundreds of tests using the simulator in seconds, while a few dozen tests on a real cloud (which we run in b2 cli extended test suite) can take several minutes.

retorquere commented 2 years ago

Since my sync is purely filename-based, a single character will do the job for the prupose of tests. But I meant how do I set up this raw api to achieve what you explain. Surely there must be some way to pre-populate this sim before I run the test, but I don't see how that one line can do that.

There are some files that I always want to deem different (and so the files in the bucket are always replaced), regardless of timestamps or filesize. If I use CompareVersionMode.NONE and NewerFileSyncMode.REPLACE, does that give me this outcome?

ppolewicz commented 2 years ago

You can upload the files to the fake bucket the same way you upload files to the real bucket.

I don't think CompareVersionMode.NONE will do the trick for you. Not sure why you would upload a file again if it has the same name, size and hash as an existing one in the cloud?

retorquere commented 2 years ago

You can upload the files to the fake bucket the same way you upload files to the real bucket.

OK. I thought I could pre-populate it as a known-good state for the actual test. This will work, I'll get on it.

I don't think CompareVersionMode.NONE will do the trick for you.

But then how do I indicate "when trying to sync the file Packages, assume it to be newer even if the timestamp of the file is older than the one in the bucket".

I have two "kinds" of files in the bucket:

Not sure why you would upload a file again if it has the same name, size and hash as an existing one in the cloud?

I didn't know it also checked the hash. The file size isn't a sufficient check because that can be the same while the contents is not, in which case I'd want to upload again. A hash would do that. That's automatic?

ppolewicz commented 2 years ago

Comparison by hash is default, yes (for reasonably "small" files, at least, B2 supports multi-terrabyte objects and those are handled a bit differently)

retorquere commented 2 years ago

All non .deb files are miniscule. So this would do it then? Or no? The debs are some 50-70MB, would that trigger a hash check? That would be perfect actually; a hash mismatch should always mean a re-upload, regardless of the filetype.

That would work as a general rule for me; just don't look at the file times, only look at the hashes.

retorquere commented 2 years ago

Waaaait.... "sync folder -> bucket, don't look at the file times, only look at the hashes" sounds like something the command line b2 client could do by default maybe?

retorquere commented 2 years ago

That'll teach me. I always tell people to tell me their problem, not their solution. Cought in my own trap.

ppolewicz commented 2 years ago

Actually default configuration of sync, I think, checks the timestamps and maybe sizes, not hashes. If you'd like there to be a setting which will verify the hash even if the size and timestamp are the same, please submit a pull request. This is not a good default because usually file modification time is a reliable indicator of a file being changed in any way, but it can be an option, why not.

retorquere commented 2 years ago

OK but then we have the problem that I don't yet know what combination of settings gets me that behavior :D, because you said

I don't think CompareVersionMode.NONE will do the trick for you.

I figured CompareVersionMode.NONE meant "filename only", then when you explained about the hash I figured it was filename + hash, now I just don't know.

But "filename + hash, regardless of timestamp, and overwrite the cloud version if the hashes differ; delete all that are not in the local folder" is what I need for all files I now realize. But I don't know how to do that.

I see CompareVersionMode can be MODTIME (don't want that), SIZE (but if the hash is already checked, why bother with the size check), and NONE (which I took to be "except hash checks, which always happen"). Between those, NONE seemed most relevant.

ppolewicz commented 2 years ago

I was wrong. There is no CompareVersionMode that would check the hash... yet. It can be added to enum and here though.

retorquere commented 2 years ago

So... best option I currently have then is use NONE? edit: Is there then an option for "always replace" (which I'd want to do for the non-debs if I can't check by hash)