broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 359 forks source link

Try to reproduce hashing timeouts in a cromwell that's not being spammed on /stats #3712

Open danbills opened 6 years ago

danbills commented 6 years ago

Discussion #1

bshifaw [3:59 PM]
Hi Chris, 
The featured joint calling method is using NIO.
https://portal.firecloud.org/#methods/gatk/joint-discovery-gatk4/9/wdl
Is this the method you are referencing? (edited)

bshifaw [4:28 PM]
@vdauwera, just confirmed with @jsoto. The wdl isn’t using NIO when importing the GVCFs. Due to a change in the wdl we decide to implement to best leverage the FC data model (using an array of input files instead of a sample name map file). (edited)

Collapse
cwhelan [9:48 PM]
right, that’s the method i was using.

vdauwera [11:22 PM]
oooh that’s an interesting case that would benefit from the flexible data models work — this would be great to show @andreah

Discussion #2

cwhelan [11:17 AM]
ie it’s trying to localize each gvcf to each shard instance

tjeandet [11:17 AM]
do you have an idea of how many input files each shard has ?

Collapse
cwhelan [11:17 AM]
555 samples

Takeaways

Run https://portal.firecloud.org/#methods/gatk/joint-discovery-gatk4/9/wdl in a non-production environment w/ 555 samples and try to reproduce issue w/ hashing timeouts.

We predict they will not occur as cromwell production was seeing elevated CPU usage due to it's /stats endpoint being hit repeatedly.

danbills commented 6 years ago

@andy7i pinging you as this is also a great perfomance test