gbrammer / eazy-py

Pythonic photometric redshift tools based on EAZY
MIT License
37 stars 25 forks source link

two requests about eazypy architecture for cloud computing and scalabilty #39

Open shongscience opened 9 months ago

shongscience commented 9 months ago

I am a principal scientist at a korean astronomy institute, especially interested in applying Big Data techs to Astronomical Problems.

I have found two issues when I try to run eazypy on my Spark Cluster.

[1] local file access for filters and parameters When running programs on Cloud, we do not have local file system, though we have "bucket", a cloud storage. Hence, all filters and sed-parameters need to be "in-memory" objects or "cloud-storable" objects.

your approach using symbolic links is not friendly for running eazypy on cloud or big data platform.

[2] your hard-wired, single node + multi-thread, optimization Unfortunately, I have found many astronomical tools are hard-optimized on "single node" + "multithread".

This specific optimization is not good for writing a "scalable" code.

Just, single thread + one by one SED fitting architecture, not loading thousands objects with running them on multi-threads, could be enough to massively parallelize the code for thousands or millions threads simulanesouly on hundreds multi-nodes cluster using big data platform.

=== I do not know whether this can be applied or not, but single node + multi-thread optimization is not good for both simple single thread run and massive multi-nodes run.

gbrammer commented 4 months ago

Thank you for the feedback @shongscience. I agree that the file I/O and multi-threading are quite naive in the current version of eazy-py, so I'd be very interested in any suggestions on how to improve things to work more efficiently in cloud / cluster environments.

gbrammer commented 1 month ago

See https://github.com/gbrammer/eazy-py/pull/46 for new behavior to avoid the symbolic links. Updates to multithreading TBD.