AstroPile / FlatironMeeting2024

AstroPile meet-up at the Flatiron Institute
https://astropile.github.io/FlatironMeeting2024/
MIT License
2 stars 3 forks source link

[Infrastructure] Data scaling and hosting infrastructure #21

Open mb010 opened 7 months ago

mb010 commented 7 months ago

Data scaling infrastructure

Making design choices that impact the usability of astropile when it comes to large data. The data volumes we are discussing and planning to have are not normal, and we cannot treat them naively without massive downsides.

Contacts: @mb010,
Participants: @mb010,

Goals and deliverable

Having a clear hosting and access plan which has been benchmarked and validated on at least two HPC systems.

Resources needed

Access to HPC facilities of our choice and a large test data set. Interest, reading skills for various documentations of different softwares are all that is required. Having experience with data pipelines or TB scale data products would be a benefit to thinking about this.

Detailed description

If we want this data to be used, we have to think about how it is going to be accessed. Data loading is one of the main bottlenecks to training neural networks. As such, we should carefully consider how this data can be hosted and how it can be processed in an effective and practical way to enable and allow for different applications.

Some functionality to consider include:

Smith42 commented 7 months ago

Thanks Micah, some thoughts from me:

Torrent vs HuggingFace

Torrenting thoughts

File formatting

mb010 commented 7 months ago

Splitting this issue soon (links to be added when the respective issues are created)

I guess these issues can all be split to make the discussions clearer. This thread can be used for the hack on benchmarking data speeds under different infra structures (i.e. HF datasets streaming vs local HDD vs single torrent host and testing of memmapping / sharding if possible). This would leave file format discussions for another / issue.

HF Datasets Streaming and Torrenting

I agree on using both HF datasets and torrents, which should be parameterisable in the data loaders provided by astropile. I don't intend us to use any paid cloud providers. If we don't have a single facility that can host the entire pile then there are other larger issues than whether or not we can manage the torrent files. I am optimistic in having multiple full mirrors: I think, as @EiffL was saying yesterday, that multiple large collaborations or archival data providers would be keen to host this data set for their researchers.

Smith42 commented 7 months ago

I can fairly easily benchmark local HF data vs HF streaming via code in AstroPT. This data is all in arrow format as per HF guidelines. From previous training runs the latency is surprisingly similar, but I will get some hard numbers and post here.

Smith42 commented 7 months ago

In https://github.com/AstroPile/AstroPile_prototype/blob/hf_benchmarking/benchmarks/data_throughput.py there is a script to check HF local vs HF remote streaming speed. I ran the code on my local machine, and local HF parquets allow us a ~4x speedup. Bare in mind that this difference can be mitigated with multiple workers, and will be less significant if we need to preprocess the data before model ingestion.

Here is the code output for those can can't run the code:

gal/s local 530.2542994290433
gal/s remote 133.98746635196582
ratio 0.2526852992917515
mwalmsley commented 7 months ago

Interested in following along with this :)

EiffL commented 7 months ago

Interesting! So, note that I think the reason for slow local access from HF in non-parquet format probably has to do with the way HF handles sequences. I found in some experiments that if you store a sequence as an array it loads much faster.

mb010 commented 7 months ago

I think these are actually experiments to be run for a version 2 potentially given our time constraints? With all of the discussions around various formatting options for the different science data, I'm pretty sure that benchmarking will need to be done for each format that we eventually add. I think I will draw up some benchmarks that we will have to run eventually but leave them for now until everything else settles a little bit.

The experiments we should run are against streaming vs specific dataset subsets. We should be able to infer what sort of formats work for our use case. We should test mounted: SSD vs HDD vs direct from HF, single vs various threadcounts, parquet vs internal custom formats (if applicable).