[Question] 十亿点边如何进行快速导入

zhaorui9303 commented 1 month ago

Problem Type (问题类型)

None

Before submit

[X] 我已经确认现有的 Issues 与 FAQ 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

Environment (环境信息)

Server Version: 1.0.0 (Apache Release Version)
Backend: RocksDB x nodes, HDD or SSD
OS: xx CPUs, xx G RAM, Ubuntu 2x.x / CentOS 7.x
Data Size: xx vertices, xx edges

Your Question (问题描述)

每天的增量数据大概有10亿条左右，底层的存储用的是hbase。通过loader工具导入的时候，运行一段时间会出现rest server is too busy to write的情况。尝试着限制了导入的速率（rate_limit.write=10000），和调大了批量写入的线程占比（batch.max_write_ratio=90），这样导入不会报错，但是导入性能会变的很差。针对大批量数据的导入有什么调优建议吗。在保证导入不报错的前提下，最大化的提高导入的性能。

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

No response

dosubot[bot] commented 1 month ago

要高效地导入每天约10亿条增量数据到HugeGraph并避免出现“rest server is too busy to write”的错误，可以参考以下调优建议：

使用批量加载：使用HBase的批量加载功能，而不是直接写入HBase。这涉及生成HFiles并将其加载到HBase中。以下是生成HFiles并加载到HBase的代码示例：

JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
    buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
Configuration conf = sinkToHBase.getHBaseConfiguration().get();
Job job = Job.getInstance(conf);
HFileOutputFormat2.configureIncrementalLoadMap(job, tableDescriptor);
conf.set("hbase.mapreduce.hfileoutputformat.table.name",
         tableDescriptor.getTableName().getNameAsString());
String path = getHFilePath(job.getConfiguration());
repartitionedRdd.saveAsNewAPIHadoopFile(path, ImmutableBytesWritable.class,
                                        KeyValue.class, HFileOutputFormat2.class,
                                        conf);
LOG.info("Saved HFiles to: '{}'", path);
flushPermission(conf, path);

重新分区和排序数据：在生成HFiles之前，重新分区和排序数据以确保其分布均匀且有序。这有助于减少HBase区域服务器的负载。
```
JavaPairRDD<ImmutableBytesWritable, KeyValue> repartitionedRdd =
    buildAndSerRdd.repartitionAndSortWithinPartitions(partitioner);
```
调整HBase配置：调整HBase配置以处理大数据量。关键参数包括：
- hbase.regionserver.handler.count：增加处理程序的数量以允许更多并发操作。
- hbase.regionserver.global.memstore.upperLimit：增加memstore大小以缓冲更多写入操作。
- hbase.hstore.blockingStoreFiles：增加在触发压缩之前允许的存储文件数量。
监控和扩展HBase集群：确保HBase集群适当扩展以处理负载。监控集群性能并在必要时添加更多区域服务器。

优化HFile生成路径：确保HFile生成路径已优化并在使用前清理。

public String getHFilePath(Configuration conf) throws IOException {
    FileSystem fs = FileSystem.get(conf);
    String fileID = fileID();
    String pathStr = fs.getWorkingDirectory().toString() + "/hfile-gen" + "/" + fileID + "/";
    Path hfileGenPath = new Path(pathStr);
    if (fs.exists(hfileGenPath)) {
        LOG.info("\n Delete the path where the hfile is generated,path {} ", pathStr);
        fs.delete(hfileGenPath, true);
    }
    return pathStr;
}

批量加载HFiles：最后，使用批量加载功能将生成的HFiles加载到HBase中。

public void loadFiles(String path) {
    try {
        // BulkLoad HFile to HBase
        sinkToHBase.loadHfiles(path, getTableName());
    } catch (Exception e) {
        LOG.error(" Failed to load hfiles", e);
    }
}

通过遵循这些实践，可以高效地将大数据量导入HBase，并最大限度地减少出现“rest server is too busy to write”错误的风险^[1]。

To continue talking to Dosu, mention @dosu.

zhaorui9303 commented 1 month ago

在没有调整rate_limit.write和batch.max_write_ratio参数的时候，导入也会抛下面的这样的错误。调整参数后，报错不复现，这种错误是不是写hbase速度太快导致的了

dosubot[bot] commented 1 month ago

是的，InsertException 错误通常是由于写入 HBase 的速度过快导致的。为了优化导入过程并避免这种错误，同时保持高性能，可以参考以下建议：

重试设置：增加失败插入的重试次数和重试间隔。
错误处理：设置允许的最大解析和插入错误数。
超时设置：调整插入操作的超时时间。

以下是一个示例配置：

{
  "import-settings": {
    "requestTimesWhenInterpolationFailed": 5,  // 重试次数
    "requestTicksWhenInterpolationFailed": 2,  // 重试间隔（秒）
    "maximumAnalyzedErrorRow": 100,            // 允许的最大解析错误数
    "maxiumInterpolateErrorRow": 50,           // 允许的最大插入错误数
    "InterpolationTimeout": 30                 // 插入超时时间（秒）
  }
}

这些设置可以通过控制重试行为和错误容忍度来管理 HBase 的负载，从而减少 InsertException 的发生，同时保持高性能^[1]。

To continue talking to Dosu, mention @dosu.

JackyYangPassion commented 1 month ago

参考下如下两个资料

loader doc
bulkload to HBase pr

10亿级别点边以上任意方式小时内能够导入完成 @zhaorui9303

cc @haohao0103

JackyYangPassion commented 1 month ago

每天的增量数据大概有10亿条左右，底层的存储用的是hbase。通过loader工具导入的时候，运行一段时间会出现rest server is too busy to write的情况。尝试着限制了导入的速率（rate_limit.write=10000），和调大了批量写入的线程占比（batch.max_write_ratio=90），这样导入不会报错，但是导入性能会变的很差。针对大批量数据的导入有什么调优建议吗。在保证导入不报错的前提下，最大化的提高导入的性能。

此处需要确认写入瓶颈点是HBase/ HugeServer/Loader

如果是Server 可以用nginx /haprox 做负载均衡如果是 loader 可以采用Spark

zhaorui9303 commented 1 month ago

上面贴出来的那个图片中的报错，这个报错是表示写入瓶颈点在HBase上吗？ @JackyYangPassion

apache / incubator-hugegraph