christianfelicite commented 4 years ago

A data lake is a system or repository of data stored in its natural/raw format,[1] usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc [2] and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). [3] A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Microsoft, or Google).

christianfelicite commented 4 years ago

https://www.capgemini.com/wp-content/uploads/2017/07/the_principles_of_the_business_data_lake_2013-12-02_v07_web.pdf

The old approach was based on the challenges of 30 years ago, multiple lifetimes in an IT sense. Today there are many more questions around data that need to be answered: • How to handle unstructured data? • How to link internal and external data? • How to adapt at the speed of business change? • How to remove the repetitive ETL cycle? • How to support different levels of data quality and governance based on differing business demands? • How to let local business units take the initiative? • How to ensure the platform will deliver and will be adopted? Combined with this the last 30 years has seen a dramatic change in the technology available. Technologies that can slash the cost of data storage, enable real-time analytics and provision information for business users at much faster speeds. It is these new challenges and the impact of new technology that has led to the Business Data Lake solution and methodology. An approach that starts with the objective of building on how the business operates and delivering a new information culture that leverages rather than fights the business culture. The Business Data Lake is built for today, using today’s technology in a way that meets the demands of business today.

Land Everything – Data Storage The first change with the Business Data Lake is the desire to land everything without modification. Using the Hadoop file system (HDFS) it is possible to simply ‘dump’ information from source systems into Hadoop and not worry about transformations or formatting. This new approach means that: • Time analysis of information is now possible • Information maps can be left until needed, and are no longer required at the start of a program before data can be ingested. This new approach is extremely quick to deliver as the technical complexity is low. It also means that IT has already made the information available for the business to use, and more than just the current information it contains the full source data history (SDH) of those source systems

christianfelicite commented 4 years ago

https://stackoverflow.com/questions/52390028/is-data-lake-and-big-data-the-same

Big Data

Is used to describe both the technology ecosystem around, and to some extent the industry that deals with, data that is in some way too big or too complex to be conveniently stored and/or processed by traditional means.

Sometimes this can be a matter of sheer data volume: Once you get into the 100s of terabytes or petabytes, your good old fashioned RDBMS databases tend to throw in the towel, and we are forced to spread our data across many disks, not just one large one. And at those volumes we'll want to parallellize our workloads, leading to things like MPP databases, the Hadoop ecosystem, and DAG-based processing.

However, volume alone does not tell the whole story. A popular definition of Big Data is described by the so-called '4 Vs': Volume, Variety, Velocity, and Veracity.

In a nutshell:

Volume - as mentioned above, refers to the difficulty caused by the size of the data
Variety - refers to the inherent complexity of dealing with disparate types of data; some of your data will be structured (think SQL data tables), while other data might be either semi-structured (XML documents) or unstructured (raw image files), and the technology to deal with this variety is nontrivial
Velocity - refers to the velocity with which new data may be generated; when collecting real time events like IoT data, or web traffic, or financial transactions, or database changes, or anything else that happens in real time, the 'velocity' of data flowing into (and in many cases, out of) your systems, can easily exceed the capabilities of traditional database technologies, necessitating some sort of scalable message bus (Kafka) and possibly a Complex Event Processing framework (such as Spark Streaming or Apache Flink)
Veracity - the final 'V', refers to the added complexity of dealing with data which often comes from sources outside of your control, and which may contain data which is invalid, erroneous, malicious, malformed, or all of the above. This adds a need for data validation, data quality checking, data normalization, and more.

In this definition, 'big data' is data which, due to the particular challenges associated with the 4 V's, is unfit for processing with traditional database technologies; while 'big data tools' are tools which are specifically designed to deal with those challenges.

Data Lake

In contrast, Data Lake is generally used as a term to describe a certain type of file or blob storage layer that allows storage of practically unlimited amounts of structured and unstructured data as needed in a big data architecture.

Some companies will use the term 'Data Lake' to mean not just the storage layer, but also all the associated tools, from ingestion, ETL, wrangling, machine learning, analytics, all the way to datawarehouse stacks and possibly even BI and visualization tools.

As a big data architect however, I find that use of the term confusing and prefer to talk about the data lake and the tooling around it as separate components with separate capabilities and responsibilities. As such, the responsibility of the Data Lake is to be the central, high-durability store for any type of data that you might want to store at rest.

By most accounts, the term 'data lake' was coined by James Dixon, Founder and CTO of Pentaho, who describes it thus:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

Amazon Web Services defines it on their page 'What Is A Data Lake':

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

From Wikipedia:

A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning.

And finally Gartner:

A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

On on-premises clusters, the data lake usually refers to the main storage on the cluster, in the distributed file system, usually HDFS, though other file systems exist, such as GFS used at Google or the MapR File system on MapR clusters.

In the cloud, data lakes are generally not stored on clusters, since it's just not cost effective to keep a cluster running at all times, but rather on durable cloud storage, such as Amazon S3, Azure ADLS, or Google Cloud Storage. Compute clusters can then be launched on demand and connect seamlessly to the cloud storage to run transformations, machine learning, analytical jobs, etc.

christianfelicite commented 4 years ago

https://ithealth.io/data-lake-retour-vers-la-definition/

TheFeloDevTeam / FeloFamilySite

Qu'est-ce que le data lake ? Le big data ? Leurs différences ? #37

Big Data

Data Lake