Open genadyp opened 10 years ago
I think we should go with 2.i 2.ii seems more complicated to handle one file, maybe we will need locks, appending to file all the time, reading it fully. Don't want to work with parts of file.
Contra: Takes much place on hard drive to store many content_data files
It it not so obvious. Looks like Shirai use-cases is only to add files. Then the statement is true. This is always true on backup machine. But, by design, we need also to store deletions, then same instance can appear 2 times - when added and when deleted. It is a real scenario for master/content machine.
What is disturbing me regarding multiple files : how to ensure that we do not lose any diff file? Is ordinary backup is enough? What about diffs of content data of backup machine?
we should have a predefined structure. One for per day for example. One base file per day and several diff files in that directory. How files will be lost? On Jun 7, 2014 9:10 AM, "genadyp" notifications@github.com wrote:
What is disturbing me regarding multiple files : how to ensure that we do not lose any diff file? Is ordinary backup is enough? What about diffs of content data of backup machine?
— Reply to this email directly or view it on GitHub https://github.com/bbfsdev/bbfs/issues/270#issuecomment-45402279.
Maybe a key/value database here would be a good solution. Low on upkeep + simple code instead of sql statements.
I am thinking about object-relational-mapping. I am not experienced in databases but there should be an abstract layer that provides a way to communicate with database in object-oriented way rather than building sql statements.
I don't believe in such object-relational-mappings. I think it will be hard to optimize. I think we should consider redis.
We should use it's scripting abilities to keep the data structure in redis with all it's commands.
Lets try building design for key/value pairs database? We need access by SHA1 and we need access by server/filename right?
On second thought, maybe rational database is a good option. It has the joins we need to do operations in file sets.
Let's think about following structure:
Entry with opening -
means that instance was removed from the file system
Entry without special sign means that new instance added.
To get diff between "Year Y1 Month M1 Day D1" and "Year Y2 Month M2 Day D2": YearY1/MonthM1/Day(D1+1) +...+ YearY1/MonthM1/Day(Last day of the month M1)
When removed entry met (with minus sign) - then remove correspondent entry from the resulting diff.
Number of diffs that will be treated - Worst case: 1..(CurrentYear-2) year cumulative diffs 22 month cumulative diffs 60 day diffs
Data duplicated three times: Day diff, month cumulative diff, year cumulative diff
What is a remove entry? Is is one row in a file?
I think we can store remove entries per day in separate file, that way we don't need to change the code. I mean for each day we will have two files:
the .*.removed file will have entries with instances that were removed during that day (and not added back right away).
I would extend the algorithm to have multiple files per day, i.e, if I set up to index my files once every 15 hours then I should get something like:
(This index also may be time related, i.e., backup.data-20140326-0802 or backup.data-20140326-1602 UTC time).
That way you don't have to "change" existing file content ever! Only add new files whenever you run your code. This will make the implementation simpler.
So basically we have two parameters: 1) index every X time. 2) save full index every Y time. where X is one day (What Shiray does in action) and Y is month.
Actually this is very similar to what we have, we just generate a full time each time, and now we will have to create two diffs files (one is .remove file) and when reading the content data there should be a bit more complex logic of finding earlier base file which should also be marked as: backup.data-20140324.base or backup.data-20140324.full or backup.data-20140324.cumulative
I feel we are getting to simple and efficient solution.
What is a remove entry? Is is one row in a file?
Actually - file that was in a previous index, but absent in current.
I think we can store remove entries per day in separate file, that way we don't need to change the code. I mean for each day we will have two files:
Agree. separate files for removed entries make code more simple
the .*.removed file will have entries with instances that were removed during that day (and not added back right away).
How we can track that file/content was not added back right away?
One folder for all years is good enough for start, i.e., no need for many folders. Lets just start with one.
We can create folders in a lazy way - when it actually needed.
No need for yearly cumulative file. It is not so clear.
First file, whatever it will be have to be cumulative. There should be no matter to do exact time manipulation. I mean that all the calculations should converge the possible error rate to 0, i.e., it is not important if we index a file and it is gone after a minute, the next content data should have it in the removed and if for some reason it is no there, then the content data right after will have it in the removed file.
save full index every Y time.
Why we need to store a full index?
backup.data-20140324.base or backup.data-20140324.full or backup.data-20140324.cumulative
What are .full, .base files?
How we can track that file/content was not added back right away? I think we don't need to. We just write a file at specific moment in time, if the file in that time is not there and were in previous session we write it. We don't care to do mistakes as our process converges to truth, i.e., we will index the files again later and the situation will be fixed.
We can create folders in a lazy way - when it actually needed. Yes, but I don't want you to write code for this, not now, just assume one directory.
Why we need to store a full index? What are .full, .base files? Here is what Shiray wrote: 1) how to store many snapshots effective (avoid huge duplicates in data) 2) how to minimize loading time of a snapshot in memory
In order to do 2 we should store once in a while, i.e., once a week or once a month a full index file, i.e., the way we store files now, full snapshot, those are the .full or .base files.
@kolmanv Could you write a summary of your vision? It differs from this suggestion and I do not see the whole picture.
@genadyp @vshiray 1) One folder. 2) Parameter save/flush to disk index every X seconds (we have this parameter). Will save only incremental changes from previous snapshot. 3) Parameter save full index every Y seconds (new parameter).
Thats it!
Is this clear enough?
Shiray please approve.
There are several topics for discussion:
1) full data must be natively ordered in time So it can be named like "unix_timestamp.data".
2) Diff must be applied to the nearest data file. But it can be several options like: a) The name part of the difference file is the same as the full data. Every record includes what has been changed and what time. We have a single difference file for every full data. b) The name part are greater than the full data and equal the time this file will be produced. Every record includes what has been changed from the nearest full data. We have a multiple difference files for every full data.
The first option is more accurate because it's resolution is equal to the period of indexing without producing many difference files. But it's more complex for implementation. The second option is good enough also. It allows to remove frequent difference files easy (after one week, one month, ...).
Branch issue270_incremental_diff
Is this ready for review? On Aug 2, 2014 9:57 AM, "genadyp" notifications@github.com wrote:
Branch issue270_incremental_diff https://github.com/bbfsdev/bbfs/tree/issue270_incremental_diff
— Reply to this email directly or view it on GitHub https://github.com/bbfsdev/bbfs/issues/270#issuecomment-50955959.
not yet
@vshiray @kolmanv @yarondbb
Is this structure good enough: date1.full full snapshot for date1 date1-date2.added files that were added from date1 till date2 date1-date2.removed files that were removed from date1 till date2 date2-date3.added files that were added from date2 till date3 date2-date3.removed files that were removed from date2 till date3
Option2: date1.full full snapshot for date1 date1-date2.added files that were added from date1 till date2 date1-date2.removed files that were removed from date1 till date2 date1-date3.added files that were added from date1 till date3 date1-date3.removed files that were removed from date1 till date3
Review please. I am implementing meanwhile first option.
It all depends on practical usage. I mean the diff between date1 and date2 and date3 and the diff between two ".full" files.
Implement what is easier.
Code review done.
Note: All date timestamps should guarantee that all files listed as stable before that second are in the content data. I mean that monitoring service has to return the timestamp of the last scan (glob).
We have a few options for the diff depending on the methods we store content data.
Currently we can use modification_time/indexation_time parameters for the diff between two timestamps.
Pro:
Contra: Can be costly and take a time (need check)
Implementations:
Pro: Simple enough.
Contra: What if we lose few such diff files? Is it a real scenario?
Pro: All data concentrated in one file
Contra: Looks more complicated than first approach