UCSF-Costello-Lab / LG3_Pipeline

The original LG3 pipeline
https://github.com/UCSF-Costello-Lab/LG3_Pipeline
0 stars 0 forks source link

SOFTWARE: Identify software tools for TIPCC-to-C4 migration #146

Closed HenrikBengtsson closed 2 years ago

HenrikBengtsson commented 3 years ago

@ivan108, regarding migrating this pipeline to C4, could list the software tools and the versions you're using right now here? Then I'll start installing them as CBI software modules.

HenrikBengtsson commented 3 years ago

I found the below. I guess you don't need all those GATK versions(?)

[henrik@cclc01 ~]$ ls -l /home/jocostello/shared/LG3_Pipeline_HIDE/tools
total 471876
drwxr-xr-x  3 jocostello songlab          4096 Feb  7  2012 bwa-0.5.10
drwxr-xr-x  8 jocostello costellolab      4096 Dec 16 11:07 FastQC.v0.11.9
drwxr-xr-x  4 jocostello costellolab       310 Dec 17  2018 gatk-4.0.12.0
drwxr-xr-x  4 jocostello costellolab       308 Jan 29  2019 gatk-4.1.0.0
drwxr-xr-x  5 jocostello costellolab       321 Apr 11  2019 gatk-4.1.1.0
drwxr-xr-x  4 jocostello costellolab       308 Apr 23  2019 gatk-4.1.2.0
drwxr-xr-x  4 jocostello costellolab       308 Nov 11  2019 gatk-4.1.4.0
drwxr-xr-x  4 jocostello costellolab       308 Nov 27  2019 gatk-4.1.4.1
drwxr-xr-x  4 jocostello costellolab       308 Mar  1  2020 gatk-4.1.5.0
drwxr-xr-x  4 jocostello costellolab       308 Mar 25  2020 gatk-4.1.6.0
drwxr-xr-x  4 jocostello costellolab       321 May 26  2020 gatk-4.1.7.0
drwxr-xr-x  4 jocostello costellolab       308 Jul 20  2020 gatk-4.1.8.1
drwxr-xr-x  4 jocostello costellolab       308 Nov  7  2020 gatk-4.1.9.0
-rw-r--r--  1 jocostello costellolab 454612009 Oct  9  2020 gatk-4.1.9.0.zip
drwxr-xr-x  3 jocostello songlab            96 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6
drwxr-xr-x  3 jocostello songlab            96 May 16  2012 GenomeAnalysisTK-1.6-5-g557da77
drwxr-xr-x  3 jocostello songlab            33 Sep 22  2011 java
-rwxrwxr-x  1 jocostello costellolab        83 Nov  8  2012 LICENSE.TXT
-rw-r--r--  1 jocostello costellolab      7833 May 26  2020 muTect-1.0.27783.help
-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
-rw-r--r--  1 jocostello costellolab   9684017 Feb  8  2013 muTect-1.1.4-bin.zip
-rw-rw-r--  1 jocostello costellolab  10438338 Nov  8  2012 muTect-1.1.4.jar
drwxr-xr-x 12 jocostello costellolab      4096 May 21 18:54 picard
drwxr-xr-x  2 jocostello songlab          4096 Mar 12  2012 picard-tools-1.64
drwxr-xr-x  3 jocostello costellolab       234 Feb 12 17:37 pindel024t
drwxr-xr-x  6 jocostello costellolab      4096 May 29  2012 samtools-0.1.12a
drwxr-xr-x  6 jocostello songlab          4096 Mar 15  2012 samtools-0.1.18
-rwxr-x---  1 jocostello costellolab     80522 Sep 28  2020 snp-pileup
drwxr-xr-x  4 henrik     costellolab       158 Sep 17  2018 TrimGalore-0.4.4
drwxr-xr-x  4 jocostello costellolab       158 Sep  4  2020 TrimGalore-0.6.6
-rw-rw-r--  1 jocostello costellolab        54 Nov  8  2012 version.txt
ivan108 commented 3 years ago

I deleted older versions of GATK4, thanks!

HenrikBengtsson commented 3 years ago

Got it.

My notes: TrimGalore requires Cutadapt, which apparently was installed centrally on TIPCC:

[henrik@cclc01 ~]$ which cutadapt 
/opt/Python/Python-2.7.9/bin/cutadapt
[henrik@cclc01 ~]$ cutadapt --version
1.8.1

So, that's the version that needs to be installed on C4 for full backward compatibility. I've now added CBI module cutadapt/1.8.1 in addition to cutadapt/3.4 on C4.

HenrikBengtsson commented 2 years ago

All but the following software versions are now available as CBI modules on C4:

drwxr-xr-x  3 jocostello songlab            96 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6
drwxr-xr-x  3 jocostello songlab            96 May 16  2012 GenomeAnalysisTK-1.6-5-g557da77
-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
-rw-rw-r--  1 jocostello costellolab  10438338 Nov  8  2012 muTect-1.1.4.jar
drwxr-xr-x  3 jocostello costellolab       234 Feb 12 17:37 pindel024t
drwxr-xr-x  6 jocostello costellolab      4096 May 29  2012 samtools-0.1.12a
drwxr-xr-x  6 jocostello songlab          4096 Mar 15  2012 samtools-0.1.18
HenrikBengtsson commented 2 years ago

Managed to get the legacy versions of samtools installed on C4;

$ module avail samtools

----------------------------------- /software/c4/cbi/modulefiles ------------------------------------
   samtools/0.1.12a (L)    samtools/1.10    samtools/1.12
   samtools/0.1.18         samtools/1.11    samtools/1.13 (D)

Remaining software is now:

drwxr-xr-x  3 jocostello songlab            96 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6
drwxr-xr-x  3 jocostello songlab            96 May 16  2012 GenomeAnalysisTK-1.6-5-g557da77
-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
-rw-rw-r--  1 jocostello costellolab  10438338 Nov  8  2012 muTect-1.1.4.jar
drwxr-xr-x  3 jocostello costellolab       234 Feb 12 17:37 pindel024t
HenrikBengtsson commented 2 years ago

I've managed to install muTect 1.1.1 and 1.1.4, cf. module load CBI; module avail mutect. Still can't find an official source for 1.0.27783 though. Remaining software is now:

drwxr-xr-x  3 jocostello songlab            96 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6
drwxr-xr-x  3 jocostello songlab            96 May 16  2012 GenomeAnalysisTK-1.6-5-g557da77
-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
drwxr-xr-x  3 jocostello costellolab       234 Feb 12 17:37 pindel024t
HenrikBengtsson commented 2 years ago

Woohoo, through some forensic internet searching using https://web.archive.org/, I managed to track down a Broad FTP server (ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/) that hosts all legacy versions of GATK (1.0-2.3.9), include above two versions:

$ curl ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/ | grep -E "GenomeAnalysisTK-(1.5.12|1.6-5)"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14480    0 14480    0     0  10379      0 --:--:--  0:00:01 --:--:-- 10372
-rw-r--r--   1 gsa-engineering wga      18104172 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6.tar.bz2
-rw-r--r--   1 gsa-engineering wga      18502494 May  3  2012 GenomeAnalysisTK-1.6-5-g557da77.tar.bz2
100 27864    0 27864    0     0  18011      0 --:--:--  0:00:01 --:--:-- 18000
HenrikBengtsson commented 2 years ago

After hours and hours, I finally managed to install pindel 0.2.4t on both TIPCC and C4 under the CBI software stack, i.e. module load CBI pindel/0.2.4t.

This leaves us with:

drwxr-xr-x  3 jocostello songlab            96 Mar 19  2012 GenomeAnalysisTK-1.5-12-gd0056d6
drwxr-xr-x  3 jocostello songlab            96 May 16  2012 GenomeAnalysisTK-1.6-5-g557da77
-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
HenrikBengtsson commented 2 years ago

And, I've managed to install GATK 1.6.5 as a module on both TIPCC and C4, i.e. module load CBI gatk/1.6-5-g557da77.

Turns out we're not using GATK 1.5-12-gd0056d6 anywhere, so that leaves with only:

-rwxr-xr-x  1 jocostello songlab       8322312 May 17  2012 muTect-1.0.27783.jar
HenrikBengtsson commented 2 years ago

We need ANNOVAR too and it's complicated. It requires online registration to access/download, and you only get the latest version. Argh. So much for reproducible science.

I think it's the version we are using is referred to as AnnoVar 2011-10-02;

]$ ${LG3_HOME}/tools/AnnoVar/annotate_variation.pl --help | grep Version
     Version: $LastChangedDate: 2011-10-02 22:13:18 -0700 (Sun, 02 Oct 2011) $
HenrikBengtsson commented 2 years ago

Since muTect is plain Java and ANNOVAR is plain Perl, we might be able to just copy them over from TIPCC to C4 as-is; not a pretty solution but that might be the only solution.

HenrikBengtsson commented 2 years ago

Posted Download muTect-1.0.27783.jar? to the GATK forum.

HenrikBengtsson commented 2 years ago

Ah, so it turns out that our existing muTect-1.0.27783.jar on TIPCC presents itself as GATK 1.1-37-g5cedb2d;

[henrik@cclc01 ~/repositories/UCSF-CostelloLab/test-next-release]$ /home/jocostello/shared/LG3_Pipeline_HIDE/tools/muTect-1.0.27783.jar --help | head -6
---------------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v1.1-37-g5cedb2d, Compiled 2011/09/14 10:01:32
Copyright (c) 2010 The Broad Institute
Please view our documentation at http://www.broadinstitute.org/gsa/wiki
For support, please view our support site at http://getsatisfaction.com/gsa
---------------------------------------------------------------------------------

That's interesting. So, I went to install GATK 1.1-37 from ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK;

[henrik@cclc01 ~]$ module load gatk/1.1-37-ge63d9d8
[henrik@cclc01 ~]$ java -jar ${GATK_HOME}/GenomeAnalysisTK.jar --help | head -6
---------------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v1.1-37-ge63d9d8, Compiled 2011/09/13 01:15:42
Copyright (c) 2010 The Broad Institute
Please view our documentation at http://www.broadinstitute.org/gsa/wiki
For support, please view our support site at http://getsatisfaction.com/gsa
---------------------------------------------------------------------------------

It turns out to have a compile date (2011-09-13 rather than 2011-09-14) and a different hash code (ge63d9d8 rather than g5cedb2d), so certainly not identical, but hopefully good enough for our migration needs.

I've installed this on both TIPCC and C4.

HenrikBengtsson commented 2 years ago

I've created annovar-2011-10-02.tar.gz from TIPCC:/home/jocostello/shared/LG3_Pipeline_HIDE/AnnoVar/ and installed it as modules on TIPCC and C4. I've also installed /home/jocostello/shared/LG3_Pipeline_HIDE/Annovar_2015Jun17/annovar.latest.tar.gz, and the latest official version (which is the only one you can download after registration yadayadayada). So, now we have:

$ module avail annovar

------------------------------------------------- /home/shared/cbc/apps/modulefiles/CBC --------------------------------------------------
   annovar/2011-10-02    annovar/2015-06-17    annovar/2020-06-07 (L,D)

This was a hack, but I think that completes our needs for software tools needed by the pipeline.

I'll next try to run through the pipeline using the software tools available from the CBI module stack. If all works well, we should be able to scratch most of ${LG3_HOME}/tools/. Closing this issue.

HenrikBengtsson commented 2 years ago

Argh... I might have been too quick about muTect-1.0.27783.jar (https://github.com/UCSF-Costello-Lab/LG3_Pipeline/issues/146#issuecomment-938183908). Although it presents itself as GATK, it's not GATK :(

HenrikBengtsson commented 2 years ago

Copied muTect-1.0.27783.jar from TIPCC:/home/jocostello/shared/LG3_Pipeline_HIDE/tools and installed as module load mutect/1.0.27783 on TIPCC and C4. Good enough for now; hopefully the Broad/GATK folks will tell us from where we can get the official version.