SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
90 stars 29 forks source link

Feature request: Option to include original IDs and annotations in fasta headers for align_features_sequences script #76

Closed alexweisberg closed 2 years ago

alexweisberg commented 2 years ago

It would be very useful to have an option to produce fasta files for each cluster that include the original locus tag as well as input file name/strain (minus .fna/.fasta) that it came from, and potentially annotation information like gene name and product. These could be separated by tabs or some other delimiter. for example, one header could look something like:

>AS1B4_0005 prev_locus:AS1B4_0004 prev_ID:AS1B4_0004 strain:AS1B4 annotation:gyrB, DNA gyrase B or something like that. That way it would be easy to work with these files and make alignments that contain the original locus tag for that gene so that I can link it back to other information for that genome.

Given that there is a significant overlap between the newly generated locus tags and the original ones, I can't just use a search and replace for the locus tags in all of the files to change them back to the original IDs.

Thanks!

SionBayliss commented 2 years ago

Hi Alex,

You can rename locus tags/IDs to their original values using the following script:

/tools/subsample/subsample_outputs.pl -i PIRATE.gene_families.ordered.tsv -o PIRATE.gene_families.ordered.renamed.tsv --field "prev_locus"

It won't transfer annotation to alignments but will give you an easy way of looking up/converting your previous locus tags and/or may form a basis of a script to do so.

All the best, Sion

alexweisberg commented 2 years ago

Dear Sion,

Thank you. I have modified the align_feature_sequences.pl script to include an option for printing additional information to the headers of the fasta alignments. When using the "--full-annot" option, the output looks something like this:

>Rhizobium_jaguaris_CCGE525_00216 prev_ID:CCGE525_35950 prev_locus:CCGE525_35950 genome:Rhizobium_jaguaris_CCGE525 gene:ehuB protein_id:AYG64154.1 product:ectoine/hydroxyectoine ABC transporter substrate-binding protein EhuB

I've attached that modified script to this message: align_feature_sequences_mod.pl.gz

As an aside, I have found that NCBI gff files largely work correctly in PIRATE if I first run them through the AGAT (https://github.com/NBISweden/AGAT) script "agat_convert_sp_gxf2gxf.pl" and then manually add the ##FASTA block to the end of the file. This way I can include the original locus tags and annotation from NCBI without reannotating them in prokka.

Best, Alex