mb010 commented 7 months ago

Data scaling infrastructure

Making design choices that impact the usability of astropile when it comes to large data. The data volumes we are discussing and planning to have are not normal, and we cannot treat them naively without massive downsides.

Contacts: @mb010,
Participants: @mb010,

Goals and deliverable

Having a clear hosting and access plan which has been benchmarked and validated on at least two HPC systems.

Resources needed

Access to HPC facilities of our choice and a large test data set. Interest, reading skills for various documentations of different softwares are all that is required. Having experience with data pipelines or TB scale data products would be a benefit to thinking about this.

Detailed description

If we want this data to be used, we have to think about how it is going to be accessed. Data loading is one of the main bottlenecks to training neural networks. As such, we should carefully consider how this data can be hosted and how it can be processed in an effective and practical way to enable and allow for different applications.

To get an idea of how smaller data sets can cause issues, have a read of this issue I found which highlights the differences clever data infrastructure can have on loading.

Some functionality to consider include:

Data sharding
Memory mapping
Huggingface hosted datasets streaming capability: loading in streaming
academic torrents
- Should we host data like this?
- How do we handle data updates?
- The 1 vs multi torrent discussion: Do we have one torrent and so encourage everyone to have all of the data (+better upload speeds for niche data subsets) (- not everyone will be able to participate) or do we have torrents for (reasonable) subsets of data so that institutes can host their science case without requiring >10TB storage for any normal user (+ more users and large institutes can hots all of the data) (-Potentially resulting in niche subsets to be hosted at suboptimal data rates as they are not required for many science cases)?
- Software infra to automatically select appropriate magnet links for specific data selection.
Data format currently expected to be HDF5, should it be? options. Arrow might be more useful depending on expected usage.

Smith42 commented 7 months ago

Thanks Micah, some thoughts from me:

Torrent vs HuggingFace

A two-pronged approach to hosting is probably best, with HuggingFace (should be okay to O(10)TB) and torrents. HuggingFace would give us a central space for data consolidation, hosted in a place where DL practitioners are already familiar. A great benefit of HuggingFace is its data streaming feature which we would not (easily) get with torrents.
- HuggingFace would require us to use either parquet files, or to push our hdf5 backend fork upstream however.

Torrenting thoughts

I think that several torrents is the way to go here for dissemination of the dataset. Best way to do this is probably serving subsets of our parquet, hdf5 or *arrow files grouped roughly by subject (i.e. galaxy imagery, spectra, text, timeseries...).
Data updates would be tricky, but we can leave that to downstream scripts and simply append to the torrents as more data is added (and of course update the central HF dataset in the usual way).
Torrenting may not be allowed on some datacentres, and would induce substantial egress charges in cloud providers. Something to think about as we look for volunteer HPC nodes to host tens of terabytes.

File formatting

An anecdote: in my day job I have been mostly using hdf5, and it works great for terabyte-scale datasets. However, we found it a little clunky to use within python and so are currently switching over to parquet/arrow because of the better interoperability with pandas, HuggingFace, and the wider ecosystem.

mb010 commented 7 months ago

Splitting this issue soon (links to be added when the respective issues are created)

I guess these issues can all be split to make the discussions clearer. This thread can be used for the hack on benchmarking data speeds under different infra structures (i.e. HF datasets streaming vs local HDD vs single torrent host and testing of memmapping / sharding if possible). This would leave file format discussions for another / issue.

HF Datasets Streaming and Torrenting

I agree on using both HF datasets and torrents, which should be parameterisable in the data loaders provided by astropile. I don't intend us to use any paid cloud providers. If we don't have a single facility that can host the entire pile then there are other larger issues than whether or not we can manage the torrent files. I am optimistic in having multiple full mirrors: I think, as @EiffL was saying yesterday, that multiple large collaborations or archival data providers would be keen to host this data set for their researchers.

Smith42 commented 7 months ago

I can fairly easily benchmark local HF data vs HF streaming via code in AstroPT. This data is all in arrow format as per HF guidelines. From previous training runs the latency is surprisingly similar, but I will get some hard numbers and post here.

Smith42 commented 7 months ago

In https://github.com/AstroPile/AstroPile_prototype/blob/hf_benchmarking/benchmarks/data_throughput.py there is a script to check HF local vs HF remote streaming speed. I ran the code on my local machine, and local HF parquets allow us a ~4x speedup. Bare in mind that this difference can be mitigated with multiple workers, and will be less significant if we need to preprocess the data before model ingestion.

Here is the code output for those can can't run the code:

gal/s local 530.2542994290433
gal/s remote 133.98746635196582
ratio 0.2526852992917515

mwalmsley commented 7 months ago

Interested in following along with this :)

EiffL commented 7 months ago

Interesting! So, note that I think the reason for slow local access from HF in non-parquet format probably has to do with the way HF handles sequences. I found in some experiments that if you store a sequence as an array it loads much faster.

mb010 commented 7 months ago

I think these are actually experiments to be run for a version 2 potentially given our time constraints? With all of the discussions around various formatting options for the different science data, I'm pretty sure that benchmarking will need to be done for each format that we eventually add. I think I will draw up some benchmarks that we will have to run eventually but leave them for now until everything else settles a little bit.

The experiments we should run are against streaming vs specific dataset subsets. We should be able to infer what sort of formats work for our use case. We should test mounted: SSD vs HDD vs direct from HF, single vs various threadcounts, parquet vs internal custom formats (if applicable).

AstroPile / FlatironMeeting2024

[Infrastructure] Data scaling and hosting infrastructure #21

Data scaling infrastructure

Goals and deliverable

Resources needed

Detailed description

Torrent vs HuggingFace

Torrenting thoughts

File formatting

Splitting this issue soon (links to be added when the respective issues are created)

HF Datasets Streaming and Torrenting