Open haruosuz opened 3 years ago
Thanks for the feedback on the README.
"Is it possible to generate PIRATE_plots.pdf in which Tree is drawn using a Newick file obtained by performing fasttree on core_alignment.fasta?" This can be done by passing a tree, in this case the tree generated by fasttree, to plot_PIRATE.R as an optional second argument e.g.
Rscript plot_PIRATE.R path/to/PIRATE_results_dir/ path/to/example_tree.nwk
"The core_alignment.fasta file contained many "N" (e.g. "TNNNNNNNNNA") although input nucleotide sequence data contained only alphabet of "ACGT"." Ns are added to represent sequence/genes missing in individual isolates e.g. if genomeA does not have a copy of gene1 then the length of the alignment for gene1 will be represented by Ns.
"Is there any difference between the options Global: -s|--steps and Clustering options: -s|--steps? " No
Subset alignments functionality changed during development and will not function on unique_alleles, I will update the README accordingly.
All the best, Sion
plot_PIRATE.R
should be plot_summary.R
?
$git clone https://github.com/SionBayliss/PIRATE
$find PIRATE -name "plot_PIRATE.R"
$find PIRATE -name "*.R"
PIRATE/scripts/plot_summary.R
I would be grateful if you could provide examples for Subset alignments.
Running the command printed the following messages and generated empty output_directory.
input=${DIRECTORY}/PIRATE.gene_families.tsv
fasta=${DIRECTORY}/feature_sequences/
output=output_directory
PIRATE/tools/subsetting/subset_alignments.pl -i "${input}" -f "${fasta}" -o "${output}"
- 0 of 0 genes to be processed
- 100 % clusters added to output
This looks like a bug. I will try and fix it when I get a spare minute or two.
Thanks, Sion
https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files
- [optional -a] feature_sequences directory - a directory containing all amino acid and nucleotide sequences for each gene family (aligned using MAFFT).
Running PIRATE with -a|--align
option generates a feature_sequences/ directory missing amino acid and nucleotide sequences for some gene families.
In the following case, the feature_sequences/ directory contained amino acid and nucleotide sequences for only 104 of the 139 gene families; i.e., amino acid and nucleotide sequences for 35 gene families (g0001, g0002, g0003, g0004, g0006, g0010, ...) are missing in the feature_sequences/ directory.
$tail -n +2 PIRATE.gene_families.tsv | wc -l
139
$ls feature_sequences/*.aa.fasta | wc -l
104
$cat PIRATE.gene_families.tsv | cut -f2 | sort | head
g0001
g0002
g0003
g0004
g0005
g0006
g0007
g0008
g0009
g0010
$ls feature_sequences/*.aa.fasta | head
feature_sequences/g0005.aa.fasta
feature_sequences/g0007.aa.fasta
feature_sequences/g0008.aa.fasta
feature_sequences/g0009.aa.fasta
feature_sequences/g0011.aa.fasta
feature_sequences/g0014.aa.fasta
feature_sequences/g0015.aa.fasta
feature_sequences/g0016.aa.fasta
feature_sequences/g0017.aa.fasta
feature_sequences/g0018.aa.fasta
In another case, running the following command generated the feature_sequences/ directory which contained amino acid and nucleotide sequences for only 189 of the 202 gene families.
PIRATE --input ${dir_gff} --output ${dir_pirate} --steps "0,10,20,30,40,50,60,70,80,90,95,98" --pan-opt "--cd-low 98 --evalue 1E-6 --hsp-len 0 --flat 1.5" --para-off --align --rplots --threads $(getconf _NPROCESSORS_ONLN) -z 2
That is most likely due to PIRATE excluding genes with high copy numbers. If genes are highly fragmented then alignment will likely be problematic. This can be adjusted accordingly using the --dosage option for align_feature_sequences.pl and create_pangenome_alignment.pl. The default is 1.25.
You can raise the threshold, but I would question the utility of including these genes in an alignment for certain applications (e.g. core genome trees).
I was wondering if descriptions for output files such as genome2loci.tab and loci_list.tab could be added below? https://github.com/SionBayliss/PIRATE/blob/master/README.md#output-files
The number of loci (CDS) can be obtained from the output files?
https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC6785682/
The pangenome comprised 2,858,820 loci clustered into 102,425 gene clusters of which 1,841 (1.8%) were considered core (present in >95% of isolates) (Fig. 4A). The pangenome comprised 91,593 loci clustered into 8,325 gene clusters of which 867 (10.41%) were considered core (present in >95% of isolates) (Fig. 5A).
I have questions about options in https://github.com/SionBayliss/PIRATE/blob/master/README.md
Usage
Basic examples
-a|--align
-r|--rplots
Running PIRATE with the options
-a -r
generated the core_alignment.fasta and pangenome_alignment.fasta files, and PIRATE_plots.pdf file in which Tree in page 8 (Number of gene families per sample) and page 9 (Pangenome cluster presence/absence) is drawn using the binary_presence_absence.nwk file. Is it possible to generate PIRATE_plots.pdf in which Tree is drawn using a Newick file obtained by performingfasttree
on core_alignment.fasta?The core_alignment.fasta file contained many "N" (e.g. "TNNNNNNNNNA") although input nucleotide sequence data contained only alphabet of "ACGT".
--pan-opt
should be changed to the following?
-s|--steps
-n|--nucl
Is there any difference between the options
Global: -s|--steps
andClustering options: -s|--steps
?Advanced examples
-f|--flat
In this paper (https://academic.oup.com/gigascience/article/8/10/giz119/5584409), "A default MCL inflation value of 2" was used for intra-species clustering (Figure 3. complete Staphylococcus aureus genomes), while "an MCL inflation value of 6" was used for intra-species clustering (Figure 4. Pseudomonas complete genomes) and inter-species comparisons (Figure 5. Prochlorococcus marinus draft genomes). The MCL inflation value of 2 or 6 was chosen based on the previous studies for these bacterial taxa?
I presume a default MCL inflation value was changed from 2 to 1.5.
-e|--evalue
should be changed to the following?
https://github.com/SionBayliss/PIRATE#piratetsv-file-format
PIRATE.*.tsv file format
21-22/ synteny_cluster/synteny_cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.tsv.
should be changed to the following?
21-22/ cluster/cluster_order - The syntenic cluster the gene_family has been assigned to and the corresponding order within the cluster. NOTE: these columns are only present in PIRATE.gene_families.ordered.tsv.
Support Scripts
Subset Outputs
Subsample PIRATE.gene_families.ordered.tsv file and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.gene_families.ordered.tsv
should be changed to the following?
Subsample PIRATE.*.tsv files and rename loci in output. Allows for recalculation of number of genomes gene_families are present in PIRATE.*.tsv
Subset alignments
Running the command printed the following messages and generated empty output_directory.
identify representative sequences for gene families/alleles
Identify the representative sequence for each cluster in a PIRATE.*.tsv file. The file can be found at PIRATE/scripts/representative_sequences.pl.
PIRATE/scripts/representative_sequences.pl should be the following file?
Unique gene sequences
should be the following?
Convert to roary file
should be the following?
Convert to binary presence-absence or count
Convert PIRATE.*.tsv to "gene/allele presence-absence" and "paralog presence-absence" tsv files can be done with the same command?
should be the following?