baiwfg2 / awesome-readings

记录看各种文章、论文的心得
2 stars 0 forks source link

FoundationDB #28

Open baiwfg2 opened 3 years ago

baiwfg2 commented 3 years ago

apple 的分布式kv 存储系统,支持事务,各组件高度独立. 其严格测试参考 #26

[1] https://news.ycombinator.com/item?id=27424605 , 关于sigmod21 论文的讨论

[2]http://charap.co/reading-group-foundationdb-a-distributed-unbundled-transactional-key-value-store/ , Aleksey Charapko 日常论文分享

[3] https://forums.foundationdb.org/t/how-much-time-and-resourced-are-dedicated-to-foundationdb-testing/2859 , 我抛出的关于测试耗时、耗资源的问题

[4] https://forums.foundationdb.org/t/discussion-thread-for-new-storage-engine-ideas/101 , 很好的想法,让FDB支持多引擎

[4] https://www.foundationdb.org/files/fdb-paper.pdf, sigmod21

[5] https://www.youtube.com/watch?v=nlus1Z7TVTI&list=PLbzoR-pLrL6q7uYN-94-p_-Q3hyAmpI7o&index=9 , 介绍FDB 未来的存储引擎, RedWood storage engine

[6] https://apple.github.io/foundationdb/architecture.html , 官方文档

[7] https://blog.the-pans.com/notes-on-the-foundationdb-paper/

baiwfg2 commented 3 years ago

🍑 [8] https://forums.foundationdb.org/t/technical-overview-of-the-database/135 , 有人吐槽官方文档讲得太浅

The transactions are serialized via the commit timestamp received from the master

This is why it’s strongly recommended that transactions should be idempotent, so that they handle commit_result_unknown correctly.

如果不是幂等会怎么样?

In the case of any failures in {proxy, resolver, tlog}, the entire subsystem is torn down and recreated.

The only way for a “single mutation failure” to occur would be for a transaction to crash or become network partitioned – a failure either way

🍑 [8.1] https://forums.foundationdb.org/t/is-there-more-detailed-design-documents/274/4 ,有人期望有更多的设计文档 然而还没有; 有人提到FDB 有没有pg 遇到的fsync 问题

🍑 [8.2] https://forums.foundationdb.org/t/questions-about-the-recently-accepted-fdb-paper-in-sigmod21/2732 , sigmod21 的疑问

Proxy broadcast commit version and previous commit version to all resolvers for the same reason so that resolvers can process all transactions in the commit order

If proxy can’t send to log server, this will trigger a reconfiguration, i.e., transaction system recovery


🍑 [9] https://news.ycombinator.com/item?id=16877395, 苹果 2018.4 宣布开源 让FoundationDB 原创者们兴奋; snowflake 已用多年,在其上构建了很多 layers Will Wilson 说他们在收购FDB前就构建了一个与mongodb api兼容的东西,不过有人说实现得很不好(nightmare)

有人质疑,把FDB 说得太好了,不相信没有downside,希望知道其 dirty little secrets

在苹果公布开源不久,wavefront 也表示持续投入FDB

Wavefront has been using FoundationDB extensively with over 50 clusters spanning petabytes of data in production

The client is complex and needs very sophisticated testing, so there is only one implementation. All the language bindings use the C library the "client" is usually a (stateless, higher layer) database node itself

他们把db driver 叫做client ? 在我用YCSB 测试FDB时,确实需要指定 c lib的路径才能跑

有人同我一样想知道和其它mysql, pg比有什么不一样,和use cases


🍑 [10] https://www.youtube.com/watch?v=st0VjQdpZL4&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq&index=80 , 第一作者讲解论文。讲得一般 他们的一个经验是当一个role 进程有性能瓶颈时,就把它拆成多个role,以前就把 data distributor, rateKeeper 从sequencer 中移出来了

baiwfg2 commented 3 years ago

🍑 [11] https://www.youtube.com/watch?v=EMwhsGsxfPU&list=PLbzoR-pLrL6q7uYN-94-p_-Q3hyAmpI7o , Evan ,snowflake 讲得不错 SS 只保持 5s 的recent history (为什么要这么做,其它db 这么做吗?不是只有 resolver 才保留 5s 的历史吗?) https://github.com/apple/foundationdb/blob/master/documentation/sphinx/source/kv-architecture.rst#storage-servers

如果client 不知道读哪个SS,则它问proxy,proxy 有整个map<key, SS> 像map 映射这种元数据本身也是存储在db 内的,以 \ff 开头的 key; rebalance的过程其实也是事务过程

master 尽管是单例,不会是瓶颈点,因为它的工作足够简单 怎样避免client 使用冲突大的workload 以避免resolver 部分fail 的问题?(🙋 :负载是什么样,db 怎么能控制呢?); resolver数量不要太多,尽可能scale up proxy 知道 key -> log server 的 map ? proxy 也不是每来一个请求,就去其它proxy 问询最新的version,而是批量问询 (🙋 :这不影响单个事务的查询时延吗?) (最新7.0+ 不再是这样,而是从master 拿GRV: https://forums.foundationdb.org/t/why-doesn-t-proxy-get-read-version-directly-from-master-but-need-to-broadcast-all-other-proxies/494/4

每天晚上跑randomized test,注入可能的故障,早上一来,就可以分析那些报出 error 的case; 作者8年FDB经验,有6年是用simulation testing 来跟踪问题

baiwfg2 commented 3 years ago

官方文档

🍑 [6]

Proxies periodically send empty commits to transaction logs to keep commit versions increasing, in case there is no client generated transactions

Finally, a recovery will fast forward time by 90 seconds, which would abort any in-progress client transactions with transaction_too_old error

🍑 [6.1] https://apple.github.io/foundationdb/read-write-path.html

The metadata on all proxies are consistent at any given timestamp. To achieve that, when a proxy has a metadata mutation that changes the metadata at the timestamp V1, the mutation is propagated to all proxies (through the concurrency control component), and its effect is applied on all proxies before any proxy can process transactions after the timestamp V1.

这是怎么控制的?

SS 的内存结构 是 p-tree ? 里面只放最新5s 的多版本KV数据 ,因此FDB事务运行时间不能超过5s ??

each proxy contacts the queuing system for each timestamp request to confirm it is still a valid proxy

联系具体做什么事呢?这属于心跳?(proxy 合法意味着是当前generation 的proxy,而不是older one)

Compared to serializable isolation, Strict Serializable Isolation (SSI) requires external consistency ??

(竟然不是serializable snapshot isolation )

🍑 [6.2] https://apple.github.io/foundationdb/administration.html , 部署管理

When the exclude command completes successfully (by returning control to the command prompt), the machines that you specified are no longer required to maintain the configured redundancy mode. A large amount of data might need to be transferred first, so be patient

从集群中删除节点。exclude 返回时,数据还没迁走吗?

Configuration of server-side latency bands is performed by setting the \xff\x02/latencyBandConfig key to a string encoding the following JSON document

这在哪里配置以支持status 打印延时区间统计(🍎 这统计是服务端的,从接收命令到返回响应之间)

Each fdbserver process uses up to one full CPU core, so a production FoundationDB cluster will usually run N such processes on an N-core system

每个server 进程就使用一个核 ?? 有这么玩的吗?

fdbmonitor 负责启动 fdbserver, backup_agent,它应该不是fdb 架构中的核心component 吧?论文中也没提到

In a multiple-datacenter configuration, it is recommended that you set the redundancy mode to three_datacenter and that you set the locality_dcid parameter for all FoundationDB processes in foundationdb.conf

多DC 部署

baiwfg2 commented 3 years ago

[7]

It's also not sharded, meaning the entire key space is essentially on one logical shard.

在SS层数据没有被shard ?

ClusterController monitors the health of all servers (presumably via heartbeats, as it's not mentioned in the paper).

确实在论文中没有交待是怎么监控各个server 的liveness

baiwfg2 commented 3 years ago

🍑 [12] https://forums.foundationdb.org/t/dockerized-deployment/1561/2 , 容器化

🍑 [13] https://blog.couchdb.org/2020/02/26/the-road-to-couchdb-3-0-prepare-for-4-0/ , couchdb 社区 ,IBM 团队提议将存储迁移到FDB [13.1] https://forums.foundationdb.org/t/update-couchdb-4-0-on-foundationdb/1690

baiwfg2 commented 3 years ago

[1] jeffbee 说他的经验是 FDB 并没有严格测试?

Flow 是async/await 开发框架

davgoldin 说他们把FDB 当分布式文件系统用,效果很好 (https://forums.foundationdb.org/t/whats-the-purpose-of-the-directory-layer/677https://forums.foundationdb.org/t/object-store-on-foundationdb/387/3

原FDB 团队去做自动化测试找bug了, antithesis.com

Meai 表示官方所表示的safe, robust ,很难讲做到了什么程度; 经常有隐藏的 knobs ; 不推荐用DSL,应迁移到coroutine上,因为难以理解

想在FDB 之上构建 全文检索的系统代替 Lucene,是非常困难的,后者实现很完善了难以超越

🍎  

the only really general statement i can think of is that the "larger"/"longer" your transactions are, the harder a time you'll have getting it to cooperate with FDB. "small"/"fast" transactions will be easier to fit into its model.

高竞争下性能会很差。(不知道vldb20 论文 有无效果 )

baiwfg2 commented 3 years ago

Jepsen on FDB

🍑 [14] https://web.archive.org/web/20150312112556/http://blog.foundationdb.com/foundationdb-vs-the-new-jepsen-and-why-you-should-care , jepsen 测试

🍑 [15] https://web.archive.org/web/20150312112552/http://blog.foundationdb.com/call-me-maybe-foundationdb-vs-jepsen , jepsen 作者说jepsen 还不能严格测试FDB

baiwfg2 commented 3 years ago

[16] https://www.youtube.com/watch?v=A3U8M8pt3Ks , FoundationDB summit 2019, Managing FoundationDB at Scale - John Brownlee, Apple 讲FDB 如何在 k8s 上跑

recovery 可能发生在 process failure, exclusions, reconfigurations or bounces (??)

When we bounce a cluster whether it's part of an upgrade or knob change any reason whatsoever we bounce everything at once. 同时这种bouce 策略不会让不同的进程有不同的配置

进程都是通过 fdbmonitor 来启动的,fdbmonitor 本身重启非常快

baiwfg2 commented 3 years ago

[17] https://apple.github.io/foundationdb/flow.html Flow 框架 为实现以下三个目标:性能好,基于actor model,便于simulation

For example, one computer could create a promise/future pair, then send the promise to another computer over the network

在wait(future)时,不阻塞其它 actor

An ACTOR is compiled into a class internally. Which means that within an actor-function, this is a valid pointer to this class