Closed retorquere closed 2 years ago
Sure, you need to implement this interface: https://github.com/Backblaze/b2-sdk-python/blob/9ed5b69df2d0fc997b54bbdce756392ce082d449/b2sdk/sync/policy_manager.py#L17
and then you need to feed it into the sync. Sync was designed to be customizable so it shouldn't cause you much trouble. Let us know how it works out for you, or please ask if you run into trouble. Also, please observe that you can easily write fast tests using raw_simulator
. This way you can write the testcase first and then once you have it running (on stock SyncPolicyManager) and failing, you can adjust the code and see it pass the tests when you are done.
And then I'd have to create a custom sync policy class to instantiate and return from get_policy
, right? Which inherits from AbstractFileSyncPolicy?
The policy I need is probably simple:
I think you can use the existing policies, you'd just decide differently than the original synchronizer which policy to use. You can even inherit from SyncPolicyManager
and do something in cases you care about, but if the case doesn't match your specific pattern,
return super().get_policy(
sync_type,
source_path,
source_folder,
dest_path,
dest_folder,
now_millis,
delete,
keep_days,
newer_file_mode,
compare_threshold,
compare_version_mode,
encryption_settings_provider,
)
to do whatever the standard synchronizer would do in those other cases.
I don't understand. Is get_policy called per filename? What's the point where in intervene?
It is called for every filename, yes.
I don't understand the second question.
I thought a policy spanned all files in the sync. It seems I got that wrong -- get_policy
is called per file-pair, and returns a policy object which implements some methods that decide what needs to be done with this pair. Which methods must this policy object implement?
All the policy objects you need are already implemented (maybe with an exception of a "do nothing" policy). Click here to see an example policy (if you really want to go this way). Upwards from that line is the abstract object and it only has one abstract method.
Ah OK, so my policy manager only needs to pick the right one! Got it. Yeah my needs are pretty niche -- I'm maintaining a deb repo in a bucket and I need all files locally to rebuild it, but I don't need to upload the unchanged files.
I think in your case .deb
files have different names for different versions, so b2 sync --compareVersions none
would assume the files are exactly the same by just looking at the file name (not checking size, not checking mod time, not checking contents - just the file name). This mode is already supported in b2-sdk-python and in cli.
The thing is any non-deb file must always be updated.
Yes, you can run the sync process twice, once excluding everything non-Deb with --excludeRegex and then again for deb files only
Message ID: @.***>
I must have done this wrong then previously. Excluded files are not touched at all, rather than being treated as absent? I thought that's what I saw. But if this two-stage sync should work, that's great.
I think it should be possible as it is. If it's not, then perhaps b2cli should be adjusted to support it. If it won't work for you, please file an issue in b2cli repo with reproduction scenario.
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate or assist you further.
you can easily write fast tests using
raw_simulator
. This way you can write the testcase first and then once you have it running (on stock SyncPolicyManager) and failing, you can adjust the code and see it pass the tests when you are done.
Is there documentation or sample code for this?
I suggest you look at this: https://github.com/Backblaze/b2-sdk-python/blob/900df6b6921692cc8bc812c5d199b5989bdefcef/test/unit/bucket/test_bucket.py#L181 and other tests which inherit it
I found that, but I don't see how that interacts with a custom policy manager.
I suppose I must somehow pass this to sync_folders
? I really can't make sense of how this all hangs together. Would it be possible to provide me with a scaffold that does policy selection and bucket emulation? I can probably take it from there.
You should look at how the original sync uses it (all the way to B2 CLI if you have to). The simulator is installed at the B2Api level, which creates a Bucket object and then when you call sync between two Bucket objects you'll use whichever raw_api interface was provided for B2Api (raw_simulator for emulator or raw_api for production, probably).
I don't really understand what's being said here. I should piece it together by taking the B2 CLI and stripping that until it does what I need? I already find it difficult to see the coordinated parts, the full B2 CLI will be more complex no?
Ah, I know what's up. Synchronizer class has a bug which prevents users from easily replacing the policy manager. I'll try to fix it next week.
@retorquere review #305 please, there is an example in a test now
Given that this is a PR, how can I test this?
git clone git@github.com:Backblaze/b2-sdk-python.git
cd b2-sdk-python
git checkout sync_customizability
pip install .
I'm trying to forge the testcase into a standalone script, but I'm once again stuck. With this
from b2sdk.sync.policy import UpPolicy
from b2sdk.sync.action import B2DownloadAction, B2UploadAction, B2CopyAction, AbstractSyncEncryptionSettingsProvider, UploadSourceLocalFile
from b2sdk.sync.policy_manager import SyncPolicyManager
def assert_folder_sync_actions(self, synchronizer, src_folder, dst_folder, expected_actions):
"""
Checks the actions generated for one file. The file may or may not
exist at the source, and may or may not exist at the destination.
The source and destination files may have multiple versions.
"""
actions = list(
self._make_folder_sync_actions(
synchronizer,
src_folder,
dst_folder,
TODAY,
self.reporter,
)
)
assert expected_actions == [str(a) for a in actions]
class MySyncPolicyManager(SyncPolicyManager):
def get_policy_class(self, sync_type, delete, keep_days):
return UpPolicy
synchronizer = synchronizer_factory(
compare_version_mode=CompareVersionMode.SIZE,
keep_days_or_delete=KeepOrDeleteMode.DELETE,
sync_policy_manager=MySyncPolicyManager(),
)
src = folder_factory('local', ('a.txt', [200], 11))
dst = folder_factory('b2', ('a.txt', [100], 10))
# normally_expected = [
# 'b2_upload(/dir/a.txt, folder/a.txt, 200)',
# 'b2_delete(folder/a.txt, id_a_100, (old version))'
# ]
expected = ['b2_upload(/dir/a.txt, folder/a.txt, 200)']
self.assert_folder_sync_actions(synchronizer, src, dst, expected)
I do not know where to get synchronizer_factory, and I also don't see where the raw_simulator comes in.
Is there really no documentation from which I could get a sense of what concepts are used and how they hang together (not api docs; api docs are for referral when I already know broadly how things work). I feel pretty shitty about asking for what I fear will just boil down to "implement this for me", but things are just not clicking at all for me.
There is a classification of documentation types here: https://documentation.divio.com I think we do have a tutorial, a little bit of how-to-guide and a reference but we are short on the explanation side.
There are two things you need to do, lets do the customization first and once that works, lets move onto running things on a simulator.
Wehn using the version from the branch, you can pass a sync_policy_manager
keyword argument to the constructor of Synchronizer
. When you do that, it will use that custom policy manager instead of the default. If my understanding is correct, this is what you wanted to do, right?
Does it work for you?
Yes, this works.
How does the b2sdk decide on the sync_type?
Sure, you need to implement this interface:
When I copy this class verbatim as a start for a policy manager, I get
AttributeError: 'RepoPolicyManager' object has no attribute 'exclude_all_symlinks'
Do not mistake SyncPolicyManager with ScanPolicyManager
My current state is at https://gist.github.com/8895e20ea17afc14ec55cad441e8887c
Wait what does the scan manager do then?
ScanManager
decides which files (and file versions) to care about - you can only sync .mp3 files, for example, if that's what you wish and ScanManager
performs the filtering for you.
Line 106 should be sync_policy_manager = RepoPolicyManager()
, not policies_manager = RepoPolicyManager()
Oh interesting - why not just have the policy manager return a NoOpPolicy for files that we don't want to deal with rather than a separate manager?
because ScanPolicyManager may skip over directories it doesn't want to handle and in your design it would have to emit a NoOpPolicy for every single file version in there (and there may be millions of file versions this sync operation should not care about)
Got it. I think I have it working! So, how to add the simulator?
because ScanPolicyManager may skip over directories it doesn't want to handle and in your design it would have to emit a NoOpPolicy for every single file version in there (and there may be millions of file versions this sync operation should not care about)
Just as a point of curiosity -- wouldn't a NoOpPolicy
for the folder passed to get_policy do that? Not contesting the scan manager, just curious.
Maybe it could be implemented your way. I actually prefer the current implementation because it keeps the policy decisions and the scan filtering in separate classes - merging those would create a big class that would be harder to work with.
You should try something like
api = B2Api(
account_info, api_config=B2HttpApiConfig(_raw_api_class=RawSimulator)
)
Maybe it could be implemented your way. I actually prefer the current implementation because it keeps the policy decisions and the scan filtering in separate classes - merging those would create a big class that would be harder to work with.
Fair enough.
You should try something like
api = B2Api( account_info, api_config=B2HttpApiConfig(_raw_api_class=RawSimulator) )
I can do that but I'm not sure what this changes about the behavior. If there's a simulator I assume I can feed it pseudo-files to mimic the actual behavior?
it's a cloud-in-ram, so please don't upload large files or you'll run out of memory. You can create a local directory with some files and you can sync it to the fake api, then inspect to see if it synced the files you wanted to sync and that it didn't sync the files you didn't want to sync. You can then modify the local directory and run the sync again, and inspect again - without waiting for a latency for communication with a cloud on the other side of the internet, so way faster. To ensure b2-sdk-python quality we run hundreds of tests using the simulator in seconds, while a few dozen tests on a real cloud (which we run in b2 cli extended test suite) can take several minutes.
Since my sync is purely filename-based, a single character will do the job for the prupose of tests. But I meant how do I set up this raw api to achieve what you explain. Surely there must be some way to pre-populate this sim before I run the test, but I don't see how that one line can do that.
There are some files that I always want to deem different (and so the files in the bucket are always replaced), regardless of timestamps or filesize. If I use CompareVersionMode.NONE
and NewerFileSyncMode.REPLACE
, does that give me this outcome?
You can upload the files to the fake bucket the same way you upload files to the real bucket.
I don't think CompareVersionMode.NONE
will do the trick for you. Not sure why you would upload a file again if it has the same name, size and hash as an existing one in the cloud?
You can upload the files to the fake bucket the same way you upload files to the real bucket.
OK. I thought I could pre-populate it as a known-good state for the actual test. This will work, I'll get on it.
I don't think
CompareVersionMode.NONE
will do the trick for you.
But then how do I indicate "when trying to sync the file Packages, assume it to be newer even if the timestamp of the file is older than the one in the bucket".
I have two "kinds" of files in the bucket:
.deb
packages, where the bucket object is considered newer if it exists, regardless of timestamps, never overwrite.Packages
, and if that differs, regardless of timestamps, always overwriteNot sure why you would upload a file again if it has the same name, size and hash as an existing one in the cloud?
I didn't know it also checked the hash. The file size isn't a sufficient check because that can be the same while the contents is not, in which case I'd want to upload again. A hash would do that. That's automatic?
Comparison by hash is default, yes (for reasonably "small" files, at least, B2 supports multi-terrabyte objects and those are handled a bit differently)
All non .deb files are miniscule. So this would do it then? Or no? The debs are some 50-70MB, would that trigger a hash check? That would be perfect actually; a hash mismatch should always mean a re-upload, regardless of the filetype.
That would work as a general rule for me; just don't look at the file times, only look at the hashes.
Waaaait.... "sync folder -> bucket, don't look at the file times, only look at the hashes" sounds like something the command line b2 client could do by default maybe?
That'll teach me. I always tell people to tell me their problem, not their solution. Cought in my own trap.
Actually default configuration of sync, I think, checks the timestamps and maybe sizes, not hashes. If you'd like there to be a setting which will verify the hash even if the size and timestamp are the same, please submit a pull request. This is not a good default because usually file modification time is a reliable indicator of a file being changed in any way, but it can be an option, why not.
OK but then we have the problem that I don't yet know what combination of settings gets me that behavior :D, because you said
I don't think
CompareVersionMode.NONE
will do the trick for you.
I figured CompareVersionMode.NONE
meant "filename only", then when you explained about the hash I figured it was filename + hash, now I just don't know.
But "filename + hash, regardless of timestamp, and overwrite the cloud version if the hashes differ; delete all that are not in the local folder" is what I need for all files I now realize. But I don't know how to do that.
I see CompareVersionMode
can be MODTIME
(don't want that), SIZE
(but if the hash is already checked, why bother with the size check), and NONE
(which I took to be "except hash checks, which always happen"). Between those, NONE
seemed most relevant.
I was wrong. There is no CompareVersionMode
that would check the hash... yet. It can be added to enum and here though.
So... best option I currently have then is use NONE
? edit: Is there then an option for "always replace" (which I'd want to do for the non-debs if I can't check by hash)
Is there a way to create a custom sync policy? I want to keep some files in the bucket unchanged even though the local counterpart has these files with a newer timestamp. I've tried with exclude regexes but that will treat the excluded files as missing and deletes them in the bucket.