SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
91 stars 29 forks source link

README.md options #63

Open haruosuz opened 3 years ago

haruosuz commented 3 years ago

I have questions about options in https://github.com/SionBayliss/PIRATE/blob/master/README.md

Usage

Basic examples

Running PIRATE with the options -a -r generated the core_alignment.fasta and pangenome_alignment.fasta files, and PIRATE_plots.pdf file in which Tree in page 8 (Number of gene families per sample) and page 9 (Pangenome cluster presence/absence) is drawn using the binary_presence_absence.nwk file. Is it possible to generate PIRATE_plots.pdf in which Tree is drawn using a Newick file obtained by performing fasttree on core_alignment.fasta?

The core_alignment.fasta file contained many "N" (e.g. "TNNNNNNNNNA") although input nucleotide sequence data contained only alphabet of "ACGT".

Is there any difference between the options Global: -s|--steps and Clustering options: -s|--steps?

 Global:
 -s|--steps     % identity thresholds to use for pangenome construction
                [default: 50,60,70,80,90,95,98]

 -n|--nucl      CDS are not translated to AA sequence [default: off]
    Clustering options:

    -s|--steps      multiple % id thresholds to use for pangenome 
                    construction, comma seperated 
                    [default: 50,60,70,80,90,95,98]
    -n|--nucl       create pangenome on nucleotide sequence 
                    [default: amino acid]

Advanced examples

In this paper (https://academic.oup.com/gigascience/article/8/10/giz119/5584409), "A default MCL inflation value of 2" was used for intra-species clustering (Figure 3. complete Staphylococcus aureus genomes), while "an MCL inflation value of 6" was used for intra-species clustering (Figure 4. Pseudomonas complete genomes) and inter-species comparisons (Figure 5. Prochlorococcus marinus draft genomes). The MCL inflation value of 2 or 6 was chosen based on the previous studies for these bacterial taxa?

I presume a default MCL inflation value was changed from 2 to 1.5.

 MCL options:
    -f|--flat       mcl inflation value [default: 1.5]
PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --e 1E-12 --hsp-len 0.5"

should be changed to the following?

PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 -e 1E-12 --hsp-len 0.5"
PIRATE -i /path/to/gff/files/ -s "95,96,97,98,99,100" -k "--cd-low 100 --evalue 1E-12 --hsp-len 0.5"

https://github.com/SionBayliss/PIRATE#piratetsv-file-format

PIRATE.*.tsv file format

21-22/ synteny_cluster/synteny_cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.tsv.

should be changed to the following?

21-22/ cluster/cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.ordered.tsv.

Support Scripts

Subset Outputs

Subsample PIRATE.gene_families.ordered.tsv file and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.gene_families.ordered.tsv

should be changed to the following?

Subsample PIRATE.*.tsv files and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.*.tsv

PIRATE.gene_families.tsv
PIRATE.gene_families.ordered.tsv
PIRATE.unique_alleles.tsv
PIRATE.genomes_per_allele.tsv

Subset alignments

PIRATE/tools/subsetting/subset_alignments.pl --help
Usage:
     subset_alignments.pl -i /path/to/PIRATE.gene_families.tab -f /path/to/sequence/alignments/

# should be changed to the following?

     subset_alignments.pl -i /path/to/PIRATE.gene_families.tsv -f /path/to/sequence/alignments/
subset_alignments.pl -i /path/to/PIRATE.gene_families.tab[PIRATE.unique_alleles.tsv] -f /path/to/PIRATE/feature_sequences/ -o ./path/to/output_directory/

# should be changed to the following?

subset_alignments.pl -i /path/to/PIRATE.unique_alleles.tsv -f /path/to/PIRATE/feature_sequences/ -o ./path/to/output_directory/

Running the command printed the following messages and generated empty output_directory.

 - 0 of 0 genes to be processed
 - 100 % clusters added to output

identify representative sequences for gene families/alleles

Identify the representative sequence for each cluster in a PIRATE.*.tsv file. The file can be found at PIRATE/scripts/representative_sequences.pl.

PIRATE/scripts/representative_sequences.pl should be the following file?

PIRATE/tools/subsetting/select_representative
PIRATE/scripts/select_representative.pl

Unique gene sequences

unique_sequences.pl -i /path/to/gene.fas -p /path/to/PIRATE.gene_families.tab -o /path/to/output_dir/

should be the following?

unique_sequences.pl -i /path/to/PIRATE/feature_sequences/g000000002.aa.fasta -p /path/to/PIRATE.gene_families.tsv -o /path/to/output_dir/
unique_sequences.pl -i /path/to/PIRATE/feature_sequences/g000000002.nucleotide.fasta -p /path/to/PIRATE.gene_families.tsv -o /path/to/output_dir/

Convert to roary file

PIRATE_to_roary.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.csv

should be the following?

PIRATE_to_roary.pl -i /path/to/PIRATE.*.tsv -o /path/to/output_file.csv

Convert to binary presence-absence or count

Convert PIRATE.*.tsv to "gene/allele presence-absence" and "paralog presence-absence" tsv files can be done with the same command?

# gene/allele presence-absence
PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.tsv

# paralog presence-absence
PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv/ -o /path/to/output_file.tsv

should be the following?

PIRATE_to_Rtab.pl -i /path/to/PIRATE.*.tsv -o /path/to/output_file.tsv
SionBayliss commented 3 years ago

Thanks for the feedback on the README.

Subset alignments functionality changed during development and will not function on unique_alleles, I will update the README accordingly.

All the best, Sion

haruosuz commented 3 years ago
haruosuz commented 3 years ago

I would be grateful if you could provide examples for Subset alignments.

Running the command printed the following messages and generated empty output_directory.

input=${DIRECTORY}/PIRATE.gene_families.tsv
fasta=${DIRECTORY}/feature_sequences/
output=output_directory
PIRATE/tools/subsetting/subset_alignments.pl -i "${input}" -f "${fasta}" -o "${output}"

 - 0 of 0 genes to be processed
 - 100 % clusters added to output
SionBayliss commented 3 years ago

This looks like a bug. I will try and fix it when I get a spare minute or two.

Thanks, Sion

haruosuz commented 3 years ago

https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files

  • [optional -a] feature_sequences directory - a directory containing all amino acid and nucleotide sequences for each gene family (aligned using MAFFT).

Running PIRATE with -a|--align option generates a feature_sequences/ directory missing amino acid and nucleotide sequences for some gene families.

In the following case, the feature_sequences/ directory contained amino acid and nucleotide sequences for only 104 of the 139 gene families; i.e., amino acid and nucleotide sequences for 35 gene families (g0001, g0002, g0003, g0004, g0006, g0010, ...) are missing in the feature_sequences/ directory.

$tail -n +2 PIRATE.gene_families.tsv | wc -l
     139

$ls feature_sequences/*.aa.fasta | wc -l
     104

$cat PIRATE.gene_families.tsv | cut -f2 | sort | head
g0001
g0002
g0003
g0004
g0005
g0006
g0007
g0008
g0009
g0010

$ls feature_sequences/*.aa.fasta | head
feature_sequences/g0005.aa.fasta
feature_sequences/g0007.aa.fasta
feature_sequences/g0008.aa.fasta
feature_sequences/g0009.aa.fasta
feature_sequences/g0011.aa.fasta
feature_sequences/g0014.aa.fasta
feature_sequences/g0015.aa.fasta
feature_sequences/g0016.aa.fasta
feature_sequences/g0017.aa.fasta
feature_sequences/g0018.aa.fasta

5931919.zip

In another case, running the following command generated the feature_sequences/ directory which contained amino acid and nucleotide sequences for only 189 of the 202 gene families.

PIRATE --input ${dir_gff} --output ${dir_pirate} --steps "0,10,20,30,40,50,60,70,80,90,95,98" --pan-opt "--cd-low 98 --evalue 1E-6 --hsp-len 0 --flat 1.5" --para-off --align --rplots --threads $(getconf _NPROCESSORS_ONLN) -z 2
SionBayliss commented 3 years ago

That is most likely due to PIRATE excluding genes with high copy numbers. If genes are highly fragmented then alignment will likely be problematic. This can be adjusted accordingly using the --dosage option for align_feature_sequences.pl and create_pangenome_alignment.pl. The default is 1.25.

You can raise the threshold, but I would question the utility of including these genes in an alignment for certain applications (e.g. core genome trees).

haruosuz commented 2 years ago

I was wondering if descriptions for output files such as genome2loci.tab and loci_list.tab could be added below? https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files

The number of loci (CDS) can be obtained from the output files?

https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC6785682/

The pangenome comprised 2,858,820 loci clustered into 102,425 gene clusters of which 1,841 (1.8%) were considered core (present in >95% of isolates) (Fig. 4A). The pangenome comprised 91,593 loci clustered into 8,325 gene clusters of which 867 (10.41%) were considered core (present in >95% of isolates) (Fig. 5A).