cancerit / cgpPindel

Cancer Genome Project Insertion/Deletion detection pipeline based around Pindel
http://cancerit.github.io/cgpPindel/
GNU Affero General Public License v3.0
28 stars 5 forks source link

No Pindel Input performance improvement after 3.3.0 upgrade #92

Closed jsmedmar closed 4 years ago

jsmedmar commented 4 years ago

Hi Keiran,

The 3.2.0 release describes a 50% performance improvement when running Pindel Input.

I did a comparison between cgpPindel v3.0.1 and v3.3.0 with a 180 coverage genome and here are the stats:

           PindelInput
              Count |                                               Time* |                                        Memory
                  n |       min     med*      ave      max          total |       min      med      ave      max    total
- (v3.0.1)        2 |  3h16m37s 9h10m29s 6h13m33s 9h10m29s       12h27m7s |      4.1G     4.3G     4.2G     4.3G     8.4G
+ (v3.3.0)        2 |    4h0m9s 9h16m10s  6h38m9s 9h16m10s      13h16m19s |      4.3G     4.4G     4.3G     4.4G     8.7G

           PindelPindel
              Count |                                               Time* |                                        Memory
                  n |       min     med*      ave      max          total |       min      med      ave      max    total
- (v3.0.1)       25 |     8m18s  103m32s  108m44s   4h0m7s  1day21h18m20s |  862.902M    10.5G    11.0G    24.6G   274.4G
+ (v3.3.0)       25 |     9m51s  108m34s  118m43s  4h15m9s   2days1h28m9s |    925.6M    11.3G    11.6G    25.3G   289.1G

           PindelPin2vcf
              Count |                                               Time* |                                        Memory
                  n |       min     med*      ave      max          total |       min      med      ave      max    total
- (v3.0.1)       25 |    19m18s   52m56s   76m48s  8h2m30s    1day8h0m19s |  115.797M 126.871M 165.506M 473.523M     4.0M
+ (v3.3.0)       25 |    29m50s   88m34s   99m53s 8h19m44s  1day15h57m34s |    110.0M   120.9M   141.7M   314.3M     3.3G

There is no change really in the runtime of the input step. I was wondering if I missed installing some system library?

And now that we are here, we may want to consider parallelizing more pin2vcf, any thoughts on this?


➜ pindel.pl --version
Version: 3.3.0
➜  pindel.pl --version
Version: 3.0.1
keiranmraine commented 4 years ago

Are you using a high-depth/badloci bed file (-badloci)?

In prior to v3.3.0 every single read was compared against the file by calling the perl tabix module, which would then do a disk read:

https://github.com/cancerit/cgpPindel/blob/3a1e1b7b8bdff0ae04b945308f655365dafe01ac/perl/lib/Sanger/CGP/Pindel/InputGen.pm#L239

Under 3.2.0 we changed to load the regions into an interval tree in memory, benchmark details are in the PR:

https://github.com/cancerit/cgpPindel/pull/75

Obviously there is some variation depending on where you are reading from, in our case a lustre filesystem. I would expect the gains to be smaller on SSD/NMVe

jsmedmar commented 4 years ago

Hi Keiran, thank you for your quick response, I'm also testing on a Lustre file system.

We are not using a -badloci BED file, the flags we are using are these:

Would you be able to point me to a badloci reference file for GRCh37? Guess I can get it from here: ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/CNV_SV_ref_GRCh37d5_brass6+.tar.gz.

jsmedmar commented 4 years ago

To clarify, the reason I'm not experiencing the performance improvement is because I'm not using -badloci, is that correct?

keiranmraine commented 4 years ago

Yes, without badloci you are potentially passing many reads to pindel that create noise in the dataset, or just increase run time.

The improvement made the assumption that people were using this option (GDC, ICGC, PanProstate projects).

You can find an example one in here:

ftp://ftp.sanger.ac.uk/pub/cancer/dockstore/human/SNV_INDEL_ref_GRCh37d5-fragment.tar.gz

pindel/HiDepth.bed.gz

At one time you could get the data from the UCSC table browser, but we recommend building from real data where possible, starting points can be found here

jsmedmar commented 4 years ago

Thanks so much Keiran, I'll close this issue and try again using -badloci.

jsmedmar commented 4 years ago

Hi Keiran,

wanted to follow up here, I added -badloci and although I saw improvements of ~30% in pindel and pin2vcf, I did not experience any performance improvement on input.

$ pindel.pl --version
Version: 3.3.0
                      PindelInput
                         Count |                                                 Time* |                                        Memory
                             n |       min      med*      ave       max          total |       min      med      ave      max    total
- (v3.0.1)                   2 |  3h16m37s  9h10m29s 6h13m33s  9h10m29s       12h27m7s |      4.1G     4.3G     4.2G     4.3G     8.4G
+ (v3.3.0)                   2 |    4h0m9s  9h16m10s  6h38m9s  9h16m10s      13h16m19s |      4.3G     4.4G     4.3G     4.4G     8.7G
- (v3.3.0 + badloci)         2 |  3h48m38s 10h54m52s 7h21m45s 10h54m52s      14h43m31s |      3.6G     3.8G     3.7G     3.8G     7.4G

                      PindelPindel
                         Count |                                                 Time* |                                        Memory
                             n |       min      med*      ave       max          total |       min      med      ave      max    total
- (v3.0.1)                  25 |     8m18s   103m32s  108m44s    4h0m7s  1day21h18m20s |  862.902M    10.5G    11.0G    24.6G   274.4G
+ (v3.3.0)                  25 |     9m51s   108m34s  118m43s   4h15m9s   2days1h28m9s |    925.6M    11.3G    11.6G    25.3G   289.1G
- (v3.3.0 + badloci)        25 |        2s     73m8s   77m37s   2h51m7s   1day8h20m28s |     65.5M    11.3G    11.5G    25.2G   286.7G

                      PindelPin2vcf
                         Count |                                                 Time* |                                        Memory
                             n |       min      med*      ave       max          total |       min      med      ave      max    total
- (v3.0.1)                  25 |    19m18s    52m56s   76m48s   8h2m30s    1day8h0m19s |  115.797M 126.871M 165.506M 473.523M     4.0M
+ (v3.3.0)                  25 |    29m50s    88m34s   99m53s  8h19m44s  1day15h57m34s |    110.0M   120.9M   141.7M   314.3M     3.3G
- (v3.3.0 + badloci)        25 |        2s    53m59s   57m13s   2h3m44s      23h50m42s |     65.5M   114.7M   122.6M   229.8M     3.0G

My guess is that I might be missing a system library or something? I'll run one more test using this image quay.io/wtsicgp/cgppindel:v3.3.0, but if you can think of something I might be missing, I'd appreciate any feedback.

keiranmraine commented 4 years ago

If you don't use -badloci on 3.0.1 you won't see the improvement as it is specific to how that file was being accessed during filtering.

I would expect:

The claims were made assuming you were using this feature

jsmedmar commented 4 years ago

Ah I see, so you are saying that I'm not seeing any performance improvement because I wasn't using badloci with 3.0.1. So basically v3.0.1 without badloci and v3.3.0+badloci will have relatively the same performance during input.

So the real 50% improvement came on v3.0.1+badloci -> 3.3.0+badloci.

Cool, I understand now. Thanks so much again!