Closed jsmedmar closed 4 years ago
Are you using a high-depth/badloci bed file (-badloci
)?
In prior to v3.3.0 every single read was compared against the file by calling the perl tabix module, which would then do a disk read:
Under 3.2.0 we changed to load the regions into an interval tree in memory, benchmark details are in the PR:
https://github.com/cancerit/cgpPindel/pull/75
Obviously there is some variation depending on where you are reading from, in our case a lustre filesystem. I would expect the gains to be smaller on SSD/NMVe
Hi Keiran, thank you for your quick response, I'm also testing on a Lustre file system.
We are not using a -badloci
BED file, the flags we are using are these:
-process
-index
-cpus
-outdir
-reference
-tumour
-normal
-simrep
-filter
-genes
-unmatched
-assembly
-species
-exclude
Would you be able to point me to a badloci
reference file for GRCh37
? Guess I can get it from here: ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/CNV_SV_ref_GRCh37d5_brass6+.tar.gz.
To clarify, the reason I'm not experiencing the performance improvement is because I'm not using -badloci, is that correct?
Yes, without badloci you are potentially passing many reads to pindel that create noise in the dataset, or just increase run time.
The improvement made the assumption that people were using this option (GDC, ICGC, PanProstate projects).
You can find an example one in here:
ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/SNV_INDEL_ref_GRCh37d5-fragment.tar.gz
pindel/HiDepth.bed.gz
At one time you could get the data from the UCSC table browser, but we recommend building from real data where possible, starting points can be found here
Thanks so much Keiran, I'll close this issue and try again using -badloci
.
Hi Keiran,
wanted to follow up here, I added -badloci
and although I saw improvements of ~30% in pindel
and pin2vcf
, I did not experience any performance improvement on input
.
$ pindel.pl --version
Version: 3.3.0
PindelInput
Count | Time* | Memory
n | min med* ave max total | min med ave max total
- (v3.0.1) 2 | 3h16m37s 9h10m29s 6h13m33s 9h10m29s 12h27m7s | 4.1G 4.3G 4.2G 4.3G 8.4G
+ (v3.3.0) 2 | 4h0m9s 9h16m10s 6h38m9s 9h16m10s 13h16m19s | 4.3G 4.4G 4.3G 4.4G 8.7G
- (v3.3.0 + badloci) 2 | 3h48m38s 10h54m52s 7h21m45s 10h54m52s 14h43m31s | 3.6G 3.8G 3.7G 3.8G 7.4G
PindelPindel
Count | Time* | Memory
n | min med* ave max total | min med ave max total
- (v3.0.1) 25 | 8m18s 103m32s 108m44s 4h0m7s 1day21h18m20s | 862.902M 10.5G 11.0G 24.6G 274.4G
+ (v3.3.0) 25 | 9m51s 108m34s 118m43s 4h15m9s 2days1h28m9s | 925.6M 11.3G 11.6G 25.3G 289.1G
- (v3.3.0 + badloci) 25 | 2s 73m8s 77m37s 2h51m7s 1day8h20m28s | 65.5M 11.3G 11.5G 25.2G 286.7G
PindelPin2vcf
Count | Time* | Memory
n | min med* ave max total | min med ave max total
- (v3.0.1) 25 | 19m18s 52m56s 76m48s 8h2m30s 1day8h0m19s | 115.797M 126.871M 165.506M 473.523M 4.0M
+ (v3.3.0) 25 | 29m50s 88m34s 99m53s 8h19m44s 1day15h57m34s | 110.0M 120.9M 141.7M 314.3M 3.3G
- (v3.3.0 + badloci) 25 | 2s 53m59s 57m13s 2h3m44s 23h50m42s | 65.5M 114.7M 122.6M 229.8M 3.0G
My guess is that I might be missing a system library or something? I'll run one more test using this image quay.io/wtsicgp/cgppindel:v3.3.0
, but if you can think of something I might be missing, I'd appreciate any feedback.
If you don't use -badloci
on 3.0.1 you won't see the improvement as it is specific to how that file was being accessed during filtering.
I would expect:
3.0.1
->3.0.1+badloci
3.0.1+badloci
-> 3.3.0+badloci
The claims were made assuming you were using this feature
Ah I see, so you are saying that I'm not seeing any performance improvement because I wasn't using badloci
with 3.0.1
. So basically v3.0.1 without badloci
and v3.3.0+badloci
will have relatively the same performance during input
.
So the real 50% improvement came on v3.0.1+badloci
-> 3.3.0+badloci
.
Cool, I understand now. Thanks so much again!
Hi Keiran,
The 3.2.0 release describes a 50% performance improvement when running Pindel Input.
I did a comparison between cgpPindel v3.0.1 and v3.3.0 with a
180
coverage genome and here are the stats:There is no change really in the runtime of the
input
step. I was wondering if I missed installing some system library?And now that we are here, we may want to consider parallelizing more
pin2vcf
, any thoughts on this?