SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
90 stars 29 forks source link

questions on output files #45

Closed limin321 closed 4 years ago

limin321 commented 4 years ago

Hi,

After finishing running PIRATE, there is an output file called core_alignment.fasta. I want to know how many genes are involved in this alignment, and which genes they are? How or where can I get these information? Because I want to know the individual genes that consist of this core_alignment.fasta. Also, I want to get amino acid alignment of core genes. Is it possible to have PIRATE create it for me? or I have to translate it from core_alignment.fasta.

Thanks a lot. Best, LC

SionBayliss commented 4 years ago

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

S

limin321 commented 4 years ago

The PIRATE.gene_families.tsv file has more information on the particular genes included in the alignment. Genes are considered to be core when they are present in >95% of all isolates. See the README for more details on the output files.

For an alignment of aa sequence you can either concatenate the aligned amino acid sequence present in the feature_sequences/ directory (*.aa.fasta) or directly translate the core alignment file. This file retains the triplicate (codon) nature of an amino acid alignment so it should be suitable for translation by other software without any preprocessing.

S

Hi S,

That's very helpful suggestion. I think I could get all core gene name based on PIRATE.gene_families.tsv and core_alignment.fasta. Feels like there is no direct way to filter genes that are present in >95% of all isolates from the table. Do you have any suggestion how you will do that? The only way I can think of right now is to use core gene alignment to blast against the whole genome sequences to get which genes are core ones.

I would like to extract core genes aa sequences and concatenate them. However, when we open PIRATE.gene_families.tsv in excel, the first two columns are allele_name (ex: g031830_000006), and gene_family (g031830). However, in feature_sequences folder, *.aa.fasta file are named like g031830.aa.fasta.

Also, there is g000001.aa.fasta file in feature_sequences folder, but I don't see g000001 in PIRATE.gene_families.tsv anywhere.

Screen Shot 2020-07-27 at 2 24 38 PM

In the picture, you will see the smallest number of gene_family name starting with g000091, however, there are plenty of *.aa.fasta files starting with g000001.aa.fasta; g000001.aa.fasta; etc What does this mean that gene_family names don't match in the table and in feature_sequences folder?

That is why I am confused after I filtered all core genes in PIRATE.gene_families.tsv, which name should I use to match *.aa.fasta files accordingly.

I am so sorry for so much questions.

Best, LC

SionBayliss commented 4 years ago

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

S

limin321 commented 4 years ago

Hi LC,

You just convert the 'number of genomes' column to a percentage using the number of input samples.

The gene family column should start with g00001. Are you sure you haven't modified it in some way? The presence of alignments with g00001 in the feature sequences directory indicates that there should be an entry in the PIRATE.gene_families.tsv file (the script that creates the alignments uses the gene_families file for an input). Alternatively, have you run PIRATE multiple times to the same output folder with different settings?

I recommend that you read the README thoroughly. The answers to most of your questions are there.

S

Hi S,

Thank you so much for the suggestion. I double checked. My excel run into some issues, making it unable to display all data. So sorry for the inconveniece.

Best,