bbfsdev / bbfs

Big brother file system (distributed file system)
14 stars 5 forks source link

Diff between two timestamps: Design #270

Open genadyp opened 10 years ago

genadyp commented 10 years ago

We have a few options for the diff depending on the methods we store content data.

  1. Currently we can use modification_time/indexation_time parameters for the diff between two timestamps.
    Pro:

    • Minimum changes.
    • It enough to have one file that contains the index. Then there is no fear that we can miss any changes, cause we have only one file that contains all the index.

    Contra: Can be costly and take a time (need check)

  2. Using incremental diff, i.e. we have a base and diffs for timestamps.
    Implementations:
    1. Base is a file and diffs are separate files
      Pro: Simple enough.
      Contra: What if we lose few such diff files? Is it a real scenario?
    2. Having one file that contains base and diffs as a separate sections. It is similar to the way as, for example CVS, handles diffs.
      Pro: All data concentrated in one file
      Contra: Looks more complicated than first approach
bbfsdev commented 10 years ago
  1. Contra: Takes much place on hard drive to store many content_data files.

I think we should go with 2.i 2.ii seems more complicated to handle one file, maybe we will need locks, appending to file all the time, reading it fully. Don't want to work with parts of file.

genadyp commented 10 years ago

Contra: Takes much place on hard drive to store many content_data files

It it not so obvious. Looks like Shirai use-cases is only to add files. Then the statement is true. This is always true on backup machine. But, by design, we need also to store deletions, then same instance can appear 2 times - when added and when deleted. It is a real scenario for master/content machine.

genadyp commented 10 years ago

What is disturbing me regarding multiple files : how to ensure that we do not lose any diff file? Is ordinary backup is enough? What about diffs of content data of backup machine?

kolmanv commented 10 years ago

we should have a predefined structure. One for per day for example. One base file per day and several diff files in that directory. How files will be lost? On Jun 7, 2014 9:10 AM, "genadyp" notifications@github.com wrote:

What is disturbing me regarding multiple files : how to ensure that we do not lose any diff file? Is ordinary backup is enough? What about diffs of content data of backup machine?

— Reply to this email directly or view it on GitHub https://github.com/bbfsdev/bbfs/issues/270#issuecomment-45402279.

bbfsdev commented 10 years ago

Maybe a key/value database here would be a good solution. Low on upkeep + simple code instead of sql statements.

genadyp commented 10 years ago

I am thinking about object-relational-mapping. I am not experienced in databases but there should be an abstract layer that provides a way to communicate with database in object-oriented way rather than building sql statements.

bbfsdev commented 10 years ago

I don't believe in such object-relational-mappings. I think it will be hard to optimize. I think we should consider redis.

We should use it's scripting abilities to keep the data structure in redis with all it's commands.

Lets try building design for key/value pairs database? We need access by SHA1 and we need access by server/filename right?

bbfsdev commented 10 years ago

On second thought, maybe rational database is a good option. It has the joins we need to do operations in file sets.

genadyp commented 10 years ago

Let's think about following structure:

File Structure

Diff file structure

Entry with opening - means that instance was removed from the file system Entry without special sign means that new instance added.

Diff calculation

To get diff between "Year Y1 Month M1 Day D1" and "Year Y2 Month M2 Day D2": YearY1/MonthM1/Day(D1+1) +...+ YearY1/MonthM1/Day(Last day of the month M1)

When removed entry met (with minus sign) - then remove correspondent entry from the resulting diff.

Complexity

Time

Number of diffs that will be treated - Worst case: 1..(CurrentYear-2) year cumulative diffs 22 month cumulative diffs 60 day diffs

Space

Data duplicated three times: Day diff, month cumulative diff, year cumulative diff

Optimizations

  1. Subtract diffs from cumulative diff if it makes sense
    Examples:
    • For Month 10: Month 11 + Month 12
    • For Month 02: Year cumulative - Month 01
  2. Space saving optimization:
    There is a file that contains meta date of all instances.
    Diff files contain indexes of an entries of this file instead of meta data itself.
bbfsdev commented 10 years ago

What is a remove entry? Is is one row in a file?

  1. I think we can store remove entries per day in separate file, that way we don't need to change the code. I mean for each day we will have two files:

    • backup.data-20140324
    • backup.data-20140324.removed

    the .*.removed file will have entries with instances that were removed during that day (and not added back right away).

  2. One folder for all years is good enough for start, i.e., no need for many folders. Lets just start with one.
  3. No need for yearly cumulative file.
  4. First file, whatever it will be have to be cumulative.
  5. There should be no matter to do exact time manipulation. I mean that all the calculations should converge the possible error rate to 0, i.e., it is not important if we index a file and it is gone after a minute, the next content data should have it in the removed and if for some reason it is no there, then the content data right after will have it in the removed file.
bbfsdev commented 10 years ago

I would extend the algorithm to have multiple files per day, i.e, if I set up to index my files once every 15 hours then I should get something like:

  1. backup.data-20140324.0
  2. backup.data-20140324.1
  3. backup.data-20140325.0
  4. backup.data-20140325.1
  5. backup.data-20140326.0

(This index also may be time related, i.e., backup.data-20140326-0802 or backup.data-20140326-1602 UTC time).

That way you don't have to "change" existing file content ever! Only add new files whenever you run your code. This will make the implementation simpler.

bbfsdev commented 10 years ago

So basically we have two parameters: 1) index every X time. 2) save full index every Y time. where X is one day (What Shiray does in action) and Y is month.

bbfsdev commented 10 years ago

Actually this is very similar to what we have, we just generate a full time each time, and now we will have to create two diffs files (one is .remove file) and when reading the content data there should be a bit more complex logic of finding earlier base file which should also be marked as: backup.data-20140324.base or backup.data-20140324.full or backup.data-20140324.cumulative

I feel we are getting to simple and efficient solution.

genadyp commented 10 years ago

What is a remove entry? Is is one row in a file?

Actually - file that was in a previous index, but absent in current.

I think we can store remove entries per day in separate file, that way we don't need to change the code. I mean for each day we will have two files:

Agree. separate files for removed entries make code more simple

the .*.removed file will have entries with instances that were removed during that day (and not added back right away).

How we can track that file/content was not added back right away?

One folder for all years is good enough for start, i.e., no need for many folders. Lets just start with one.

We can create folders in a lazy way - when it actually needed.

No need for yearly cumulative file. It is not so clear.

First file, whatever it will be have to be cumulative. There should be no matter to do exact time manipulation. I mean that all the calculations should converge the possible error rate to 0, i.e., it is not important if we index a file and it is gone after a minute, the next content data should have it in the removed and if for some reason it is no there, then the content data right after will have it in the removed file.

genadyp commented 10 years ago

save full index every Y time.

Why we need to store a full index?

genadyp commented 10 years ago

backup.data-20140324.base or backup.data-20140324.full or backup.data-20140324.cumulative

What are .full, .base files?

bbfsdev commented 10 years ago

How we can track that file/content was not added back right away? I think we don't need to. We just write a file at specific moment in time, if the file in that time is not there and were in previous session we write it. We don't care to do mistakes as our process converges to truth, i.e., we will index the files again later and the situation will be fixed.

We can create folders in a lazy way - when it actually needed. Yes, but I don't want you to write code for this, not now, just assume one directory.

Why we need to store a full index? What are .full, .base files? Here is what Shiray wrote: 1) how to store many snapshots effective (avoid huge duplicates in data) 2) how to minimize loading time of a snapshot in memory

In order to do 2 we should store once in a while, i.e., once a week or once a month a full index file, i.e., the way we store files now, full snapshot, those are the .full or .base files.

genadyp commented 10 years ago

@kolmanv Could you write a summary of your vision? It differs from this suggestion and I do not see the whole picture.

bbfsdev commented 10 years ago

@genadyp @vshiray 1) One folder. 2) Parameter save/flush to disk index every X seconds (we have this parameter). Will save only incremental changes from previous snapshot. 3) Parameter save full index every Y seconds (new parameter).

Thats it!

Is this clear enough?

Shiray please approve.

vshiray commented 10 years ago

There are several topics for discussion:

1) full data must be natively ordered in time So it can be named like "unix_timestamp.data".

2) Diff must be applied to the nearest data file. But it can be several options like: a) The name part of the difference file is the same as the full data. Every record includes what has been changed and what time. We have a single difference file for every full data. b) The name part are greater than the full data and equal the time this file will be produced. Every record includes what has been changed from the nearest full data. We have a multiple difference files for every full data.

The first option is more accurate because it's resolution is equal to the period of indexing without producing many difference files. But it's more complex for implementation. The second option is good enough also. It allows to remove frequent difference files easy (after one week, one month, ...).

genadyp commented 10 years ago

Branch issue270_incremental_diff

bbfsdev commented 10 years ago

Is this ready for review? On Aug 2, 2014 9:57 AM, "genadyp" notifications@github.com wrote:

Branch issue270_incremental_diff https://github.com/bbfsdev/bbfs/tree/issue270_incremental_diff

— Reply to this email directly or view it on GitHub https://github.com/bbfsdev/bbfs/issues/270#issuecomment-50955959.

genadyp commented 10 years ago

not yet

genadyp commented 10 years ago

@vshiray @kolmanv @yarondbb

Is this structure good enough: date1.full full snapshot for date1 date1-date2.added files that were added from date1 till date2 date1-date2.removed files that were removed from date1 till date2 date2-date3.added files that were added from date2 till date3 date2-date3.removed files that were removed from date2 till date3

Option2: date1.full full snapshot for date1 date1-date2.added files that were added from date1 till date2 date1-date2.removed files that were removed from date1 till date2 date1-date3.added files that were added from date1 till date3 date1-date3.removed files that were removed from date1 till date3

Review please. I am implementing meanwhile first option.

bbfsdev commented 10 years ago

It all depends on practical usage. I mean the diff between date1 and date2 and date3 and the diff between two ".full" files.

Implement what is easier.

bbfsdev commented 10 years ago

Code review done.

bbfsdev commented 10 years ago

Note: All date timestamps should guarantee that all files listed as stable before that second are in the content data. I mean that monitoring service has to return the timestamp of the last scan (glob).