IARCbioinfo / needlestack

Multi-sample somatic variant caller
GNU General Public License v3.0
49 stars 15 forks source link

Advice about running on gcs? #190

Open njbernstein opened 3 years ago

njbernstein commented 3 years ago

Hi there,

Do you have any advice about running needlestack on a large number of samples on google cloud?

Any chance you have a config already for it?

Do all reads get loaded into memory at the same time?

Do you ball park know how much ram would be necessary for 1000 samples or even 10,000 samples?

mfoll commented 3 years ago

Hi,

Sorry we don't have a config ready for this. Maybe have a read at the Nextflow doc here: https://www.nextflow.io/docs/latest/google.html

The main parameters to deal with memory will be nsplit: the genome (or the target region if you provide a bed file) will be split in nsplit chunks. Each chunk will be run as a job, where reads will be processed by samtools and converted in text file that will be loaded full in memory by R. The more you increase nsplit the smaller this file will be.