EBISPOT / SumStats

Summary statistics with HDF5
6 stars 2 forks source link

Review technology for serving summary stats data #191

Closed jdhayhurst closed 1 year ago

jdhayhurst commented 2 years ago

The growth of the data, cross-study query requirements and the production suitability of HDF5 are in question. Let's review/research some alternatives. Inspiration from: gwas atlas mrbase

ljwh2 commented 2 years ago

Another example: https://yanglab.westlake.edu.cn/resources/ukb_fastgwa/imp_binary/ but note they say their application is based on source code from PheWeb: https://pheweb.org/UKB-TOPMed/

sprintell commented 2 years ago

@jdhayhurst

SOME CASE STUDIES

MRC Integrative Epidemiology Unit (IEU), University of Bristol Elastic Search data infrastructure https://gwas.mrcieu.ac.uk/about

Botify 600+TB High speed search with ElasticSearch https://medium.com/botify-labs/experience-working-with-600-tb-elasticsearch-cluster-b6b5a4fa9127

Apache Spark @Scale: A 60 TB+ production use case: https://databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html

Architecture of Spark and Kafka for Scaleable Terabyte Data Processing: https://www.youtube.com/watch?v=hf5isv0gdUU

What Spark can do: https://www.youtube.com/watch?v=ymtq8yjmD9I

TUTORIALS FOR TRYING THINGS OUT:

ELASTIC SEARCH & KIBANA: Step 1 Installation: https://www.youtube.com/watch?v=qgjsD5kCrFo Step 2 - https://www.youtube.com/playlist?list=PLGZAAioH7ZlPczCGiSl_J-fvo5ZMhR-Dy Step 3: https://www.youtube.com/playlist?list=PLGZAAioH7ZlO7AstL9PZrqalK0fZutEXF Step 3: https://www.tutorialspoint.com/elasticsearch/index.htm Step 4: https://www.youtube.com/playlist?list=PLa6iDxjj_9qVaf5CsXWP-GAgZoVwKowjx
https://www.youtube.com/watch?v=e5awiVnkuEc

Elastic Cloud on Kubernetes: https://www.youtube.com/watch?v=qjnT0pU0IRo&list=PL34sAs7_26wOgpqMW_0_E95k9tq2VkMOZ&index=16

Spring Data Elastic Search: https://www.youtube.com/watch?v=dlChXjE7IHw https://www.youtube.com/playlist?list=PLXy8DQl3058OoJqGLFdqoBkBKm2T0kS9B

APACHE SPARK https://www.youtube.com/watch?v=F8pyaR4uQ2g https://www.youtube.com/watch?v=TgiBvKcGL24 https://www.youtube.com/watch?v=1kMcBH4apao&list=PL589M8KPPT1YP2lN8nXbqfheRwAnzdDLx https://www.youtube.com/watch?v=hf5isv0gdUU&list=PLYUMVUCNosJcw13MvMJClzw9J_HMef5fk Spark-java: https://www.youtube.com/watch?v=cu2E0sSlWsY

APACHE KAFKA: For Beginners: https://www.youtube.com/playlist?list=PLa6iDxjj_9qVGTh3jia-DAnlQj9N-VLGp Kafka Spark: https://www.youtube.com/watch?v=65lHphtrfo0

jdhayhurst commented 2 years ago

Hi @sprintell , thanks for the links. I had a look through, although I haven’t watched all the tutorials. I think I can sort of see the picture you're painting here. Is it that we can use ES for powering the search and we can use spark for converting the sumstats to ES json (assuming spark makes this faster)? I don’t think I quite see where Kafka fits into the picture? Would that be for brokering search requests but then they would be async?  MRC have already achieved it with ES so we know it can be done. I think the biggest challenge here has got to be the provisioning of an ES cluster that we can scale. Our data is growing so fast. The Botify link you shared has about 20X the data requires some fairly heavy infrastructure (although they have decided to move away from ES) - I don't know if EBI have support for an ES cluster that we can plug into so we would need to set up our own cluster (is that what you're suggesting by the ECK link?) or use a cluster that is hosted elsewhere. There’s a bit more about benchmarking here https://www.elastic.co/blog/benchmarking-and-sizing-your-elasticsearch-cluster-for-logs-and-metrics and here https://www.elastic.co/blog/found-sizing-elasticsearch and also https://www.cncf.io/blog/2021/03/25/how-to-build-an-elastic-search-cluster-for-production/

jdhayhurst commented 2 years ago

Started a document here: https://docs.google.com/document/d/1CIQG8UI3pqkja-XjeKpXcbgiIheSFc6AJFa_-p6rox0/edit#

jdhayhurst commented 2 years ago

drawn up entire sumstats workflow https://app.diagrams.net/#G1MZ89ysQnT27HkbmL-Y5lhHptBP3PKd1Y

ljwh2 commented 2 years ago

Just adding another example here. Pheweb which is mentioned above, but there is a parent page with a bigger collection of sumstats: https://pheweb.sph.umich.edu/ You have to first chose a sample set to search within (e.g. UKB-TOPMED-imputed, FinnMetSeq) with maximum number of phenotypes available =2400