Closed alexweisberg closed 2 years ago
Hi Alex,
That looks like an issue with missing sequences which can sometimes happen when erroneous characters/headers make it through the gene filtering step. You will notice some files didn't pass QC so something might be up with the other files. Did you annotate them with prokka?
I am happy to have a look at a subset of files if we can't find a solution.
All the best, Sion
Hi Sion,
Thanks for looking into it. Roughly half of the files were annotated by prokka (we sequenced them), and the other half were converted from NCBI gbk files.
I tried a few small subsets run with a few of our prokka-annotated genomes and a few NCBI genomes, and they completed successfully. So there may be a small subset of genomes that are having some kind of specific issue.
I found one of the locus tags that was missing from the expanded pangenome mcl file (A4_00008 from input file A4.gff). Here is what I found when I searched for it in the input file and the modified gff file:
The locus tag is somehow included in the next gene region in the modified_gff file version. When I include this gff file in a small run of only 5 genomes, it runs to completion correctly though, and the modified_gff file has this weird ID too.
I think there may be an issue due to the large size of the dataset and parallel threads on our cluster. I will try removing genomes until it runs to completion.
Best, Alex
Hi Alex,
I don't expect it is a problem with the cluster (but I might be wrong). I expect that this is an issue with a subset of files from the NCBI that have really weird/erroneous annotation. This can happen sometimes as they have not been annotated consistently or using the same pipelines. You might want to reannotate the NCBI files using prokka and see if PIRATE completes.
those new fields in the GFF are created by PIRATE. PIRATE tries to standardise locus tags in order to avoid issues with annotation. One of the early scripts in the pipeline renames all locustag/ID to the "name of the genome""ascending number of the CDS in the file" (e.g. the first CDS is called genomename_0001). The old locus tag/ID is moved to the prev_ID/prev_locustag field in the modified GFF file present in the PIRATE folder. By default it only considers CDS features. This isn't a terribly elegant way to fix the issue but I was encountering many issues similar to yours with inconsistent annotation impacting on the outputs.
I hope that helps.
All the best, Sion
Hi,
I re-annotated my genomes with prokka, and I was able to run an analysis of >1000 genomes successfully with 32 CPU cores in 16 hours. Thanks for helping me get it set up!
I noticed in the manual that the section on converting the output to a binary format ("Convert to binary presence-absence or count") likely has a typo. The command referring to generating a paralog presence/absence table should probably refer to the "paralogs_to_Rtab.pl" script rather than the "PIRATE_to_Rtab.pl" script. Currently both commands are identical.
Thanks, Alex
Hi Alex,
I am glad I could help.
All the best, Sion
Hi, I've set up an initial run with ~800 genomes, using the diamond mode to speed things up. I used the following command:
/home/weisberga/Software/PIRATE-1.0.4/bin/PIRATE -i input/ -o /data/weisbeal/pirateout/ -t 32 --pan-opt '--diamond'
and it produced the output:
The run appears to proceed much of the way through up until the pangenome reinflation step. I get the following error message in the fail_test.txt file:
Reinflated sequences (3539249) does not match input number of sequences (4950432) at 50 threshold in sample pan_sequences.
If you would like, I can send some of the output files or the input genomes if that would help. Thanks!