Closed jdhayhurst closed 1 year ago
Another example: https://yanglab.westlake.edu.cn/resources/ukb_fastgwa/imp_binary/ but note they say their application is based on source code from PheWeb: https://pheweb.org/UKB-TOPMed/
@jdhayhurst
MRC Integrative Epidemiology Unit (IEU), University of Bristol Elastic Search data infrastructure https://gwas.mrcieu.ac.uk/about
Botify 600+TB High speed search with ElasticSearch https://medium.com/botify-labs/experience-working-with-600-tb-elasticsearch-cluster-b6b5a4fa9127
Apache Spark @Scale: A 60 TB+ production use case: https://databricks.com/blog/2016/08/31/apache-spark-scale-a-60-tb-production-use-case.html
Architecture of Spark and Kafka for Scaleable Terabyte Data Processing: https://www.youtube.com/watch?v=hf5isv0gdUU
What Spark can do: https://www.youtube.com/watch?v=ymtq8yjmD9I
ELASTIC SEARCH & KIBANA:
Step 1 Installation: https://www.youtube.com/watch?v=qgjsD5kCrFo
Step 2 - https://www.youtube.com/playlist?list=PLGZAAioH7ZlPczCGiSl_J-fvo5ZMhR-Dy
Step 3: https://www.youtube.com/playlist?list=PLGZAAioH7ZlO7AstL9PZrqalK0fZutEXF
Step 3: https://www.tutorialspoint.com/elasticsearch/index.htm
Step 4: https://www.youtube.com/playlist?list=PLa6iDxjj_9qVaf5CsXWP-GAgZoVwKowjx
https://www.youtube.com/watch?v=e5awiVnkuEc
Elastic Cloud on Kubernetes: https://www.youtube.com/watch?v=qjnT0pU0IRo&list=PL34sAs7_26wOgpqMW_0_E95k9tq2VkMOZ&index=16
Spring Data Elastic Search: https://www.youtube.com/watch?v=dlChXjE7IHw https://www.youtube.com/playlist?list=PLXy8DQl3058OoJqGLFdqoBkBKm2T0kS9B
APACHE SPARK https://www.youtube.com/watch?v=F8pyaR4uQ2g https://www.youtube.com/watch?v=TgiBvKcGL24 https://www.youtube.com/watch?v=1kMcBH4apao&list=PL589M8KPPT1YP2lN8nXbqfheRwAnzdDLx https://www.youtube.com/watch?v=hf5isv0gdUU&list=PLYUMVUCNosJcw13MvMJClzw9J_HMef5fk Spark-java: https://www.youtube.com/watch?v=cu2E0sSlWsY
APACHE KAFKA: For Beginners: https://www.youtube.com/playlist?list=PLa6iDxjj_9qVGTh3jia-DAnlQj9N-VLGp Kafka Spark: https://www.youtube.com/watch?v=65lHphtrfo0
Hi @sprintell , thanks for the links. I had a look through, although I haven’t watched all the tutorials. I think I can sort of see the picture you're painting here. Is it that we can use ES for powering the search and we can use spark for converting the sumstats to ES json (assuming spark makes this faster)? I don’t think I quite see where Kafka fits into the picture? Would that be for brokering search requests but then they would be async? MRC have already achieved it with ES so we know it can be done. I think the biggest challenge here has got to be the provisioning of an ES cluster that we can scale. Our data is growing so fast. The Botify link you shared has about 20X the data requires some fairly heavy infrastructure (although they have decided to move away from ES) - I don't know if EBI have support for an ES cluster that we can plug into so we would need to set up our own cluster (is that what you're suggesting by the ECK link?) or use a cluster that is hosted elsewhere. There’s a bit more about benchmarking here https://www.elastic.co/blog/benchmarking-and-sizing-your-elasticsearch-cluster-for-logs-and-metrics and here https://www.elastic.co/blog/found-sizing-elasticsearch and also https://www.cncf.io/blog/2021/03/25/how-to-build-an-elastic-search-cluster-for-production/
Started a document here: https://docs.google.com/document/d/1CIQG8UI3pqkja-XjeKpXcbgiIheSFc6AJFa_-p6rox0/edit#
drawn up entire sumstats workflow https://app.diagrams.net/#G1MZ89ysQnT27HkbmL-Y5lhHptBP3PKd1Y
Just adding another example here. Pheweb which is mentioned above, but there is a parent page with a bigger collection of sumstats: https://pheweb.sph.umich.edu/ You have to first chose a sample set to search within (e.g. UKB-TOPMED-imputed, FinnMetSeq) with maximum number of phenotypes available =2400
The growth of the data, cross-study query requirements and the production suitability of HDF5 are in question. Let's review/research some alternatives. Inspiration from: gwas atlas mrbase