SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
88 stars 29 forks source link

need suggestions on my PIRATE commend. #50

Closed limin321 closed 3 years ago

limin321 commented 3 years ago

Hello I run PIRATE on my around 300 bacterium genomes, using the following command:

PIRATE -i ./311gff/ -o ./311Agro_panOut -a -r -k "-f 6" -t 40   

Which is, I used default identity thresholds. The ~300 strains ANI values ranges from 77.556 to 100 %. There is one gene called tolB which are highly identical > 98%, however, in the PIRATE_gene_presence_absence.ordered.tsv table, these three genes were classified into three gene_families. e.g. tolB gene_family g000845, the cutoff PIRATE used is 50%; tolB in g032043_000006 gene family, the cutoff is 98%; tolB in g044923_000006 gene family, the cutoff is 98%.

However, the interesting is: g032043_000006 is only found in one species(containing 29 strains), and g044923_000006 gene family is also only found in one specific species (only three strains), though g000845, g032043, g044923 are highly similar > 98.8%. Feels like the default setting somehow is able to put genes from the same species in one gene family, at least for gene tolB in these two species. I also checked tolB in other species, then I got the 50% cutoff gene family g000845. and then how to interpret why the three gene family are not cluster as one gene family, and one even was clustered with other genes with 50% cutoff. I am really confused on how to interpret this data. I am wondering if you can give some suggestions on how to modify my PIRATE codes. I would like to try different things to compare based on your experience. Actually I sent you email though never get your reply. Now I am posting here to ask for help again. I am sorry if my question is too basic to ask. Really appreciate your help.

Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

It has to do with how PIRATE splits gene families containing paralogous genes. The largest gene family is likely to be present at high frequency (g000845). You will most likely find that the gene families g032043_000006 and g044923_000006 are present in genomes which contain a copy of g000845. PIRATE has identified them as multi-copies of g000845 and split them from the main gene family accordingly. You can disable this behaviour with the '--para-off' flag but no paralogs will be identified.

Sorry I didn't get your email but support on here is preferred as it can be used to help future users.

All the best, Sion

limin321 commented 3 years ago

Hi Limin,

It has to do with how PIRATE splits gene families containing paralogous genes. The largest gene family is likely to be present at high frequency (g000845). You will most likely find that the gene families g032043_000006 and g044923_000006 are present in genomes which contain a copy of g000845. PIRATE has identified them as multi-copies of g000845 and split them from the main gene family accordingly. You can disable this behaviour with the '--para-off' flag but no paralogs will be identified.

Sorry I didn't get your email but support on here is preferred as it can be used to help future users.

All the best, Sion

Hi Sion, You are right, gene family g000845 can be found in all strains. Then I did more research, here is what I found. g044923, g000845, g008718 are three different gene families. GenomeA contains g044923_000006 with gene called Sta004_03864, and g000845 with gene called Sta004_03739; GenomeB contains g008718(I copied it wrong as g032043) with gene called Kin002_04268 and g000845 with gene called Kin002_04147.

why does PIRATE put Sta004_03739, Kin002_04147 and CC001_03341 (all three in the same gene family g000845) in the same gene family though the cutoff is only 50%, instead of putting Sta004_03864, Kin002_04268 and CC001_03341 (each in its own gene family, but AA sequence similarity greater than 98 %)

You said " It has to do with how PIRATE splits gene families containing paralogous genes. " do you mean in genome A, Sta004_03864 and Sta004_03739 are paralogs? In genome B, Kin002_04268 and Kin002_04147 are paralogs? I put the AA sequences of all 5 genes below.

In the PIRATE github instruction, you explained that in gene_family column, _1, _2 indicating they are paralogs. However, in my genome A, Sta004_03864 and Sta004_03739 are not labeled as _1, _2. so as in genome B, indicating they are not paralogs. If they are not paralogs, then how to explain why the most similar genes (Sta004_03864, Kin002_04268 and CC001_03341) are not grouped together as one gene family?

by the way, in gene_family column, _1, _2 etc indicates paralogs. then in allele_name column, like g044923_000006, what is the meaning of "_000006". There are many allele_names have this type of dash tail numbers.

###########

Sta004_03864_tolB_g044923_000006_29/311_98cutoff_cluster_7494 MKIFSPIRLVLAVAALMSVFSAPAFARVEININKGNVEPLPIAI TDFISGDGIGAQVSAVIAADLQRSGLFAPVNKSAFIEKIANPDQQPRFEDWKVINAQA LVTGRVTRESDGRLRAEFRLWDTFAGQQMTGQQFFTQPENWRRVAHIIADAIYERVTG EKGYFDTRVVFVSESGPKTARKRQLSIMDQDGFNVRNLTNSNDIVLTPRFSPNRQEVT YMSFEGQQPRVYLLQLETGQREVVGNFPGMTFSPRFSPDGQKVIMSLQQDGNANIYTM DLRSRTTTRLTNTSAIDTAPSFSPDGQRVAFESDRGGRQQIYVMNADGSGQQRISFGD GSYSTPVWSPRGDLIAFTKQSGGKFSIGVMKTDGSGERILTSGFHNEGPTWAPNGRVL MFFRQNAGAGGPQLYSIDLTGYNEQLIKTPTFASDPAWSPLLD

Kin002_04268_tolB_g008718_000001__g008718_14/311_60cutoff_cluster_9900 MKIFSPIRLVLAVAALMSVFSAPAFARVEININKGNVEPLPIAI TDFMSSDGIGAQVSAVIAADLQRSGLFAPVNKSAFIEKISNPDQQPRFEDWKVINAQA LVTGRVTRESDGRLRAEFRLWDTFANQQMTGQQFFTQPENWRRVAHIIADAIYERITG EKGYFDTRVVFVSESGPKTARKRQLSIMDQDGFNVRNLTNSNDIVLTPRFSPNRQEVT YMSFEGQQPRVYLLQLETGQREVVGNFPGMTFSPRFSPDGQKVIMSLQQDGNANIYTM DLRSRTTTRLTNTSAIDTAPSYSPDGQRVAFESDRGGRQQIYVMNADGSGQQRVSFGD GSYSTPVWSPRGDLIAFTKQSGGKFSIGVMKTDGSGERILTSGFHNEGPTWAPNGRVL MFFRQNAGAGGPQLYSIDLTGYNEQLIKTPTFASDPAWSPLLD

CC001_03341_tolB_g000845__g000845_core_gene_311/311_50cutoff_cluster_273 MKIFSPIRLVLAVAALMSVFSAPAFARVEININKGNVEPLPIAI TDFMSSDGIGAQVSAVIAADLQRSGLFAPVNRNAFIEKISNPDQQPRFEDWKVINAQA LVTGRVTRESDGRLRAEFRLWDTFANQQMTGQQFFTQPENWRRVAHIIADAIYERVTG EKGYFDTRVVFVSESGPKTARKRQLSIMDQDGFNVRNLTNSNDIVLTPRFSPSRQEVT YMSFEGQQPRVYLLQLETGQREVVGNFPGMTFSPRFSPDGQKVIMSLQQDGNANIYTM DLRSRTTTRLTNTSAIDTAPSYSPDGQRVAFESDRGGRQQIYVMNADGSGQQRISFGD GSYSTPVWSPRGDLIAFTKQSGGKFSIGVMKTDGSGERILTSGFHNEGPTWAPNGRVL MFFRQNAGAGGPQLYSIDLTGYNEQLIKTPTFASDPAWSPLLD

Kin002_04147_g000845 MKIAIDPHMHRFTSLDALPAKVKELGFDWIELSPRADFLEWFKA PRVFPERIRSFKKALKDADVGIAALLPMYRWASNDETERQAAVKHWKRAIEIAIELEV DTMNSEFGRGPHPDKGSCYCCHTGSMIEACEDAWWRSMEELVPIFERENINLHVEPHP EDWCETLQPALDIIRTVNSKNVKFLYCAPHTFYFGDDTKAMLREAKDVLAHVHVGDTF NHKASSGLRYILNPPGTQARVHQH

Sta004_03739_g000845 MSNIIGITMGDPCGVGPEITVRALAEMSAPDREATRIYGNLATL EAARGALGIDIDLTRNVVDLPVEGAPLPWGTLSPVAGDAAFRFIEKAVRDAEAGAIGC IVTAPINKEALNLAGHHYDGHTGMLRSLTGSSAAYMLLASERLKVIHVSTHVSLQEAI QRATTERVLATIRAGNAHLKRIGYERPRIAVAGINPHCGENGLFGTEDDDQIAPAVAA ARAEGIEVHGPISADTVFHRAYSGTFDLVVAQYHDQGHIPIKLVAFDTAVNVSVDLPI DRTSVDHGTAFDIAGKGIANHGNMNEAIAYARKLVAGKGARG ##############

I also run PIRATE again, with --para-off flags. It generate 49354 gene families. (with --para-on, it generate 52233 gene families.) PIRATE -i ./311gff/ -o ./311Agro_panOut --para-off -a -r -k "-f 6" -t 40 The three very similar genes: Sta004_03864, Kin002_04268 and CC001_03341 are still not in the same gene family , with Sta004_03864 in g048671_000006, Kin002_04268 in g008859_000001, with CC001_03341 in g000845. In my understand, if we run blast, these three genes should be grouped together because they have so high sequence identity.

I also run PIRATE with self-defined percent, here is the code, this generates 67276 gene families. PIRATE -i ./311gff/ -o ./311Agro_panOut -s "65,70,75,80,85,90,95" -a -r -k "-f 6" -t 40

Now I am a little lost on how to adjust my PIRATE command to run the pangenome.

I am sorry for the long statements, could you please help me clarify my questions? is my understanding on PIRATE wrong?? I really need your suggestion on how to improve my PIRATE codes for this pan-genome analysis. Thank you so much. Really appreciate you being so patient with me.

Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

Could you attach your three genomes of interest and the PIRATE.gene_families.tsv file please.

The gene_families denotes an identifier for the cluster which including alleles at all AA/MCL thresholds. The allele column denotes a specific cluster of loci at a single AA threshold. In the PIRATE.gene_families.tsv file this is functionally identical information but the distinction is important for interpretation of the PIRATE.unique_alleles.tsv file.

All the best, Sion

limin321 commented 3 years ago

Hi Limin,

Could you attach your three genomes of interest and the PIRATE.gene_families.tsv file please.

The gene_families denotes an identifier for the cluster which including alleles at all AA/MCL thresholds. The allele column denotes a specific cluster of loci at a single AA threshold. In the PIRATE.gene_families.tsv file this is functionally identical information but the distinction is important for interpretation of the PIRATE.unique_alleles.tsv file.

All the best, Sion Hi Sion

Thank you for trying to help. considering my data is unpublished yet, can I email you? s.bayliss@bath.ac.uk is this email address working. this is the one I sent last time that you seems not able to get it. I just want to confirm it.

Best, Limin

limin321 commented 3 years ago

Hi Limin, Could you attach your three genomes of interest and the PIRATE.gene_families.tsv file please. The gene_families denotes an identifier for the cluster which including alleles at all AA/MCL thresholds. The allele column denotes a specific cluster of loci at a single AA threshold. In the PIRATE.gene_families.tsv file this is functionally identical information but the distinction is important for interpretation of the PIRATE.unique_alleles.tsv file. All the best, Sion Hi Sion

Hi Sion

I just emailed you my data, please let me know if you do not get the email.

Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

I haven't received it. Could you send them to s_bayliss(at)hotmail.co.uk. It will probably be next week before I can look at this.

All the best, Sion

limin321 commented 3 years ago

s_bayliss(at)hotmail.co.uk

Hi Sion

I just sent again to the new email. Please let me know if you don't get it.

Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

I think I have found the source of your confusion. PIRATE renames loci to ensure consistent parsable nomenclature (see README). When working with raw PIRATE outputs you should refer to the ./modified_gff/*gff for the position and names of features. If you want to convert the PIRATE outputs to have the original locus_tags/IDs you can use the command:

 ~/Desktop/github/PIRATE/tools/subsetting/subsample_outputs.pl -i ./PIRATE.gene_families.tsv -g ./modified_gffs/ -o PIRATE.gene_families.renamed.tsv --field "prev_ID"

This workflow is not ideal but it means that PIRATE is compatible with as many datasets as possible (I have seen some extremely non-standard locus_tags). I ran your dataset and after converting to original locus tags CC001_03341 - Kin002_04269 - Sta004_03864 clustered together.

All the best, Sion

limin321 commented 3 years ago

Hi Limin,

I think I have found the source of your confusion. PIRATE renames loci to ensure consistent parsable nomenclature (see README). When working with raw PIRATE outputs you should refer to the ./modified_gff/*gff for the position and names of features. If you want to convert the PIRATE outputs to have the original locus_tags/IDs you can use the command:

 ~/Desktop/github/PIRATE/tools/subsetting/subsample_outputs.pl -i ./PIRATE.gene_families.tsv -g ./modified_gffs/ -o PIRATE.gene_families.renamed.tsv --field "prev_ID"

This workflow is not ideal but it means that PIRATE is compatible with as many datasets as possible (I have seen some extremely non-standard locus_tags). I ran your dataset and after converting to original locus tags CC001_03341 - Kin002_04269 - Sta004_03864 clustered together.

All the best, Sion

Thank you so much, I just got time working on this again, could you please do not close this issue for now, until I finished my analysis. Thanks a lot. Best, Limin

limin321 commented 3 years ago

Hi Limin,

I think I have found the source of your confusion. PIRATE renames loci to ensure consistent parsable nomenclature (see README). When working with raw PIRATE outputs you should refer to the ./modified_gff/*gff for the position and names of features. If you want to convert the PIRATE outputs to have the original locus_tags/IDs you can use the command:

 ~/Desktop/github/PIRATE/tools/subsetting/subsample_outputs.pl -i ./PIRATE.gene_families.tsv -g ./modified_gffs/ -o PIRATE.gene_families.renamed.tsv --field "prev_ID"

This workflow is not ideal but it means that PIRATE is compatible with as many datasets as possible (I have seen some extremely non-standard locus_tags). I ran your dataset and after converting to original locus tags CC001_03341 - Kin002_04269 - Sta004_03864 clustered together.

All the best, Sion

Hi Sion,

Thank you so much. After converting the original table, everything works well so far. I have one more question, one of the pirate output file called pangenome.gfa, representing all unique connections between gene families. Could you please explain a little bit more, what are those unique connections? Because I used the gene_presence_absence.csv table to run coinfinder, which run association analysis between gene families. Then I was thinking what is the difference of the unique connections from PIRATE pangenome.gfa and associate groups from coinfinder?

Thanks a lot. Best, Limin

SionBayliss commented 3 years ago

Hi Limin,

The gfa file specification can be found here - https://github.com/GFA-spec/GFA-spec. You can load the file into Bandage (https://rrwick.github.io/Bandage/) to visualise a graph of your pangenome. I have found it useful. I will close this issue as it has been resolved.

Cheers, Sion