Synchronisation tools - Githubissues

skeenan commented 10 years ago

Proposed topic for ASHG Synchronisation tools. There are tools within the Hadoop ecosystem to pull down data one is missing. How do we synchronise data between distributed repositories? How do we cache data, handle local copies/remote copies in a multi-cloud world?

pgrosu commented 10 years ago

Do you mean fault-tolerance in the file system (HDFS, GPFS, etc.), or that Hadoop/MapReduce/Yarn/etc. should restart the processing again if there are any failures in resource availability while processing the data?

Also we should take a look at GPFS vs. HDFS to compare which file system might fit better here, especially for large data processing and analysis.

max-biodatomics commented 10 years ago

Also we should take a look at GPFS vs. HDFS to compare which file system might fit better here, especially for large data processing and analysis.

Paul: I wouldn't use different file systems. Look all file systems which well integrated in hadoop support HDFS API. So, we should only work with HDFS API. Otherwise, we will need to discuss what to do with Lustre, CEPH, MapR and several other file systems which getting popularity. If we will require a full compatibility with HDFS API then user can decide what to use.

fnothaft commented 10 years ago

@pgrosu @max-biodatomics as per discussion with @massie, the question is more, if I have dataset A on n sites, how do I make sure that they have identical data? How can I make sure that someone hasn't locally modified one of the two copies, or that a sample in one of the datasets isn't corrupted. This should be data store level, not file system level.

One approach is to compute checksums over portions of the dataset (e.g., blocks of records), and to then hierarchically compute checksums (checksum all blocks over a single sample, then all samples over a single experiment). I believe @delagoya brought up a good point, specifically, our checksums may need to be order independent, which might necessitate the use of a collision-prone checksum like CRC32. However, by hierarchically dividing the space we are checksumming, we will reduce the probability of checksum collisions (if block 10 in sample Frank has the same checksum as block 100 in sample Matt, we don't care, because we know they're in different samples).

If you implement this, then you can determine which parts of the dataset are different between the n sites without doing a full diff. Then, you can limit the data you touch when you resolve the differences.

A side point we'll have to address here is how the use of UUIDs plays into this. We currently make heavy use of UUIDs throughout the API, but if a datum can have different UUIDs at different sites, this process becomes much more complex.

max-biodatomics commented 10 years ago

We probably should setup rules how UUIDs generated and make it depend on data + metadata.

adamnovak commented 10 years ago

Do we want this sort of verification to be cryptographically secure? Maybe in order to verify data integrity at a local site when trying to recover from a hack? Because CRC32 is reversible and wouldn't cut it.

pgrosu commented 10 years ago

Max: I definitely agree that we should explore the major ones, since we're building something from scratch. My feeling is that we should take our time, and do it the right way from the aspect of fast petabyte search, processing and analysis. This requires several discussions with more people jumping in. I understand what you mean, but I thought we focus on Hadoop and the file system is flexible. For instance, you can modify the Hadoop configuration to use GPFS. In any case, I think we now have the time to re-explore them in the context of the long term vision of our goals. I'm sure the FileFormat team should have some input on this.

Frank: If you're talking geo-replicated file systems, that's a whole different ballgame. I mean dealing with this level of detail is a specialized field and we definitely should not re-invent it. Besides the ones that Max mentioned, the other that comes to mind is GlusterFS, and you can set it up as a replicated volume. But this is a huge area! Even Cloudera offers HDFS replication. I really would like to hear from the FileFormat team on this, and anyone else that wants to jump in. This has to be hashed out really well, since once in place it is nearly impossible to switch.

fnothaft commented 10 years ago

@pgrosu I'm not talking about geo-replicated file systems. I am not talking about file systems. Here's the setup:

Hospital H and University U are part of Coalition C, which generates dataset D. (For simplicity, let's say that dataset D is frozen.)
Both Hospital H and University U maintain mirrors of dataset D, where the dataset is accessed through the GA4GH API
Hospital H notices that server 1234, which stored dataset D had a hardware defect that was causing data corruption.

How does Hospital H identify which parts of dataset D were corrupted, if any? Do they need to copy dataset D over from University U and do an XOR? This is the broader synchronization issue; if you and I have the "same" data, we should be able to perform a cheap data integrity check to make sure that we actually have the same data.

As an aside, I wouldn't make assumptions about file systems, implementation backends, or anything. I doubt that Google is running HDFS, or any of the other file systems mentioned.

pgrosu commented 10 years ago

Frank, we have to be careful how deep we go regarding designing a framework for ensuring dataset consistency. Let's say a dataset is a compressed file of multiple files. If you perform a CRC check as @adamnovak suggested then we are starting to look at a level of granularity that could overwhelm the bandwidth. Let's say a file of this dataset is corrupt, will you ping a master to initiate the transfer or do you store the last known repository that your system knowns has a clean copy? If you want to pull only the corrupt bytes, how will you know which ones they are if you have a 1 TB file? You would need to know which slice within the file are corrupted. If so, then you're basically keeping a lot of overhead about each dataset. This is a whole area of computer science, which is well researched but requires specialized expertise. Preferably someone from the FileFormat team chip in.

Basically we are getting to the question of how do you define what a dataset looks like, and how you represent it on a system.

Since you're interested in getting a glimpse of one of the approaches by Google, here's a link to the Spanner paper but this requires infrastructure that I doubt we'll implement:

http://research.google.com/archive/spanner-osdi2012.pdf

So as you probably know HDFS was derived from GFS (Google File System), which you can read more about in the following paper:

http://research.google.com/archive/gfs-sosp2003.pdf

If you are talking about the current implementation, I believe that's called Colossus (or GFS2) and it implements features from Coding Theory and Information Theory - the latter is my thesis subject area - to reduce the amount of the overhead for replication, by the use of Reed-Solomon codes.

max-biodatomics commented 10 years ago

Frank: I think you already know solution. You are using a parquet based format which can have a checksum build in. Basically, ADAM format probably already resolved this problem.

Can you share it with us?

fnothaft commented 10 years ago

@max-biodatomics Definitely! The parquet format keeps a CRC32 checksum per page of data. I believe another checksum is built from the page checksums and kept in the "file" metadata, but I'm not sure. @massie has been digging into Parquet's metadata store recently, so he'd be a better person to comment.

@pgrosu this isn't an intractable problem as long as we keep appropriate metadata. Data integrity checking is an important topic for any data management system.

vadimzalunin commented 10 years ago

HTTP can be protected with CRC32. BAM kinda has CRC32 for each 64k block. CRAM3 has CRC32 for each container (typically up to 1MB) and block (these can be small). NCBI's SRA format I think is also checksum protected. But, even though spot fixes are possible, in practice files are usually replaced as a whole.

From the SRA experience, MD5 on each file and it is not that difficult to replicate all public data between all INSDC members. This of course violates the file-less abstraction of the API.

Transfers can be dealt with outside of the API or by better checksum in response objects if needed. Storage synching may be too difficult to achieve in abstract terms. Besides, data consumers don't care how repos talk to each other, this is a B2B interface.

cassiedoll commented 10 years ago

We propose closing this in favor of #142 for ASHG.

tetron commented 10 years ago

To tie in with my comment #142 -- using content hashes as identifiers for data blocks makes the synchronization problem much more tractable. It becomes possible to identify both intended and unintended changes to the data, and transfer only those that blocks are missing or corrupt. In Arvados we've recently implemented an "arv-copy" tool which can transfer an entire analysis pipeline from one site to another, by recursively copying the scripts, runtime environment (based on Docker), and reference data needed to execute the pipeline. This is surprisingly easy due to the fact that Git objects, Docker images, and the Arvados Keep filesystem are content addressed, so items can be transferred without having to worry about the usual synchronization challenges such as name collisions and write-write conflicts.

cassiedoll commented 10 years ago

I believe #167 is now addressing this issue, and that this can be closed. Does anyone disagree?

awz commented 10 years ago

@cassiedoll - as I understand this, #167 is now the "parent" with various "child" issues (this one among them). Synchronization of two GA4GH end-points could/should still be discussed somewhere unless you want to close all the "child" issues?

cassiedoll commented 10 years ago

Okay - that makes sense - will leave open. Thanks!

delagoya commented 9 years ago

Closing does not prevent view of this issue. Issue #167 covered a bunch of the issues and is closed. The following project holds an implementation of digests with set examples to talk around: https://github.com/massie/gastore

Finally the newish "Disgests, Containers, Workflows" working group will create a separate repository (or use the one above) to have timely and relevant discussion around.

ga4gh / ga4gh-schemas

Synchronisation tools #150