TGSAI / mdio-python

Cloud native, scalable storage engine for various types of energy data.
https://mdio.dev/
Apache License 2.0
37 stars 13 forks source link

New "Cloud Native" mode for ingesting remote files from a cloud environment #467

Closed tasansal closed 1 week ago

tasansal commented 1 week ago

Summary

A new environment variable called MDIO__IMPORT__CLOUD_NATIVE trades off available bandwidth against random read latency. Added to documentation as well.

It is only helpful in a high-speed throughput environment, such as data and ingestion machine(s) in the cloud.

Values that will enable it are {"True", "1", "true"}. For instance:

$ export MDIO__IMPORT__CLOUD_NATIVE="true"

Details

When we scan the headers of a remote SEG-Y file, the ideal case is to read ONLY headers for each trace to minimize bandwidth requirements. However, this causes millions of requests, a performance bottleneck even with multiprocessing or threading. If the client has a very slow internet connection, this will still be okay. When reading local files from SSD, this is fine; mechanical drives may still be problematic and benefit from the flag.

This MDIO__IMPORT__CLOUD_NATIVE flag enables buffered reading of the file regardless of where the ingestion occurs. If the file is on the cloud, and ingestion machine(s) are on the cloud with high-throughput between machine(s) and object store, this flag works very well. The only disadvantage is it reads the file twice (just like any other buffered read). However, this tradeoff significantly increases the ingestion performance on a cloud-native environment and at a lower cost (fewer requests to the object).