SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

Error after MCL clustering step #74

Closed alexweisberg closed 2 years ago

alexweisberg commented 2 years ago

Hi, I've set up an initial run with ~800 genomes, using the diamond mode to speed things up. I used the following command:

/home/weisberga/Software/PIRATE-1.0.4/bin/PIRATE -i input/ -o /data/weisbeal/pirateout/ -t 32 --pan-opt '--diamond'

and it produced the output:


-------------------------------

PIRATE input options:

 - Input Directory = /foldercontainingtherun/pirate/input
 - Output directory = /data/weisbeal/pirateout
 - PIRATE will run using 32 cores
 - 879 files in input directory.
 - PIRATE will be run on 50,60,70,80,90,95,98 amino acid % identity thresholds.
 - PIRATE will be run on features annotated as CDS
 - Pangenome contruction will use diamond instead of BLAST

-------------------------------

Standardising and checking input files:

 - 874 gff files passed QC and will be analysed by PIRATE - completed in: 145s

-------------------------------

 - creating co-ordinate files - completed in: 22s
 - creating genome loci list: - completed in: 58s

-------------------------------

Extracting pangenome sequences:

 - completed in: 185s

-------------------------------

Constructing pangenome sequences:

Options:

 - Creating pangenome on amino acid % identity using DIAMOND.
 - Input directory: /data/weisbeal/pirateout
 - Output directory:    /data/weisbeal/pirateout/pangenome_iterations
 - Number of input files: 1
 - Threshold(s): 50 60 70 80 90 95 98
 - MCL inflation value: 1.5
 - Homology test cutoff: 0.001
 - Loci file contains 5117774 loci from 874 genomes.
 - Extracting core loci during cdhit clustering
 - Opening pan_sequences
 - /data/weisbeal/pirateout/pan_sequences.fasta contains 4950431 sequences.
 - Passing 4950432 loci to cd-hit at 100%
 - command: "cd-hit -i /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.temp.fasta -o /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.100 -aS 0.9 -c 1 -T 32 -g 1 -n 5 -M 8413 -d 256 >> /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.cdhit_log.txt"
 - Passing 4950432 loci to cd-hit at 99.5%
 - command: "cd-hit -i /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.temp.fasta -o /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.99.5 -aS 0.9 -c 0.995 -T 32 -g 1 -n 5 -M 8413 -d 256 >> /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.cdhit_log.txt"
 - Passing 4950432 loci to cd-hit at 99%
 - command: "cd-hit -i /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.temp.fasta -o /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.99 -aS 0.9 -c 0.99 -T 32 -g 1 -n 5 -M 8413 -d 256 >> /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.cdhit_log.txt"
 - Passing 4950432 loci to cd-hit at 98.5%
 - command: "cd-hit -i /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.temp.fasta -o /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.98.5 -aS 0.9 -c 0.985 -T 32 -g 1 -n 5 -M 8413 -d 256 >> /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.cdhit_log.txt"
 - Passing 4950432 loci to cd-hit at 98%
 - command: "cd-hit -i /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.temp.fasta -o /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.98 -aS 0.9 -c 0.98 -T 32 -g 1 -n 5 -M 8413 -d 256 >> /data/weisbeal/pirateout/pangenome_iterations/pan_sequences.cdhit_log.txt"
 - completed in 9709 secs

 - 0 core loci (0%)
 - 4950432 non-core loci (100%)

 - 1374034 representative loci passed to blast.

 - running all-vs-all DIAMOND on pan_sequences
 - completed in 10490 secs

 - running mcl on pan_sequences at 50
 - 56403 clusters at 50 % - completed in 20370 secs
 - running mcl on pan_sequences at 60
 - 85382 clusters at 60 % - completed in 6469 secs
 - running mcl on pan_sequences at 70
 - 123880 clusters at 70 % - completed in 5543 secs
 - running mcl on pan_sequences at 80
 - 191027 clusters at 80 % - completed in 4210 secs
 - running mcl on pan_sequences at 90
 - 339128 clusters at 90 % - completed in 3275 secs
 - running mcl on pan_sequences at 95
 - 513638 clusters at 95 % - completed in 3942 secs
 - running mcl on pan_sequences at 98
 - 804456 clusters at 98 % - completed in 5130 secs

 - reinflating clusters for pan_sequences

The run appears to proceed much of the way through up until the pangenome reinflation step. I get the following error message in the fail_test.txt file:

Reinflated sequences (3539249) does not match input number of sequences (4950432) at 50 threshold in sample pan_sequences.

If you would like, I can send some of the output files or the input genomes if that would help. Thanks!

SionBayliss commented 2 years ago

Hi Alex,

That looks like an issue with missing sequences which can sometimes happen when erroneous characters/headers make it through the gene filtering step. You will notice some files didn't pass QC so something might be up with the other files. Did you annotate them with prokka?

I am happy to have a look at a subset of files if we can't find a solution.

All the best, Sion

alexweisberg commented 2 years ago

Hi Sion,

Thanks for looking into it. Roughly half of the files were annotated by prokka (we sequenced them), and the other half were converted from NCBI gbk files.

I tried a few small subsets run with a few of our prokka-annotated genomes and a few NCBI genomes, and they completed successfully. So there may be a small subset of genomes that are having some kind of specific issue.

I found one of the locus tags that was missing from the expanded pangenome mcl file (A4_00008 from input file A4.gff). Here is what I found when I searched for it in the input file and the modified gff file: pirate_duplicate_id The locus tag is somehow included in the next gene region in the modified_gff file version. When I include this gff file in a small run of only 5 genomes, it runs to completion correctly though, and the modified_gff file has this weird ID too.

I think there may be an issue due to the large size of the dataset and parallel threads on our cluster. I will try removing genomes until it runs to completion.

Best, Alex

SionBayliss commented 2 years ago

Hi Alex,

I don't expect it is a problem with the cluster (but I might be wrong). I expect that this is an issue with a subset of files from the NCBI that have really weird/erroneous annotation. This can happen sometimes as they have not been annotated consistently or using the same pipelines. You might want to reannotate the NCBI files using prokka and see if PIRATE completes.

those new fields in the GFF are created by PIRATE. PIRATE tries to standardise locus tags in order to avoid issues with annotation. One of the early scripts in the pipeline renames all locustag/ID to the "name of the genome""ascending number of the CDS in the file" (e.g. the first CDS is called genomename_0001). The old locus tag/ID is moved to the prev_ID/prev_locustag field in the modified GFF file present in the PIRATE folder. By default it only considers CDS features. This isn't a terribly elegant way to fix the issue but I was encountering many issues similar to yours with inconsistent annotation impacting on the outputs.

I hope that helps.

All the best, Sion

alexweisberg commented 2 years ago

Hi,

I re-annotated my genomes with prokka, and I was able to run an analysis of >1000 genomes successfully with 32 CPU cores in 16 hours. Thanks for helping me get it set up!

I noticed in the manual that the section on converting the output to a binary format ("Convert to binary presence-absence or count") likely has a typo. The command referring to generating a paralog presence/absence table should probably refer to the "paralogs_to_Rtab.pl" script rather than the "PIRATE_to_Rtab.pl" script. Currently both commands are identical.

Thanks, Alex

SionBayliss commented 2 years ago

Hi Alex,

I am glad I could help.

All the best, Sion