epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
40 stars 4 forks source link

--add-relatives outputs unexpected sequence ID #99

Open glajoie1 opened 3 years ago

glajoie1 commented 3 years ago

Hello,

I have been using SINA on a 16S sequences fasta file with the following command-line to obtain an alignment that included neighbour sequences, as in the online ACT implementation for small sequence sets. The reference database was downloaded from Silva.

sina -i ~/asv_ps20.fa -r ~/SILVA_138.1_SSURef_NR99_12_06_20_opt.arb -o aligned.fasta.gz -o aligned.csv --add-relatives=15

In the output alignment file, I was expecting the 'relatives' sequences to correspond to the reference sequences identified in the align_filter_slv column of the output (e.g. JF769553.1, KJ855315.1) but I am rather getting sequence IDs that are not retrievable in the Silva reference database (e.g. GYJUndar, UncCy339). The same thing happens when I'm adding the '--search' flag.

Is there a way to get the sequences identified in the align_filter_slv column in the alignment file with the query sequence? (Or get information on name matching if this is a formatting issue?)

Thank you very much for your software!

epruesse commented 3 years ago

Those are the "ARB names". Each sequence in ARB has a couple of meta-data fields, "acc" holds the accession number and "name" holds that name that you are seeing. It's an ID generated from the sequence description ("UncCy399" will be something uncultured) such that it's unique for accession + start position (to account for genomes with multiple 16S).

In theory, you should be able to export the accession into the csv using -f acc. In practice that doesn't seem to be working. I'll mark this as bug. Also - the accession should always be listed in the CSV, I think.

glajoie1 commented 3 years ago

Ok - thank you for the information. The accession was not listed in the csv, so I generated a mapping file of the arb names to the silva accession numbers and taxonomy through the arb software using the SILVA_138.1_SSURef_NR99_12_06_20_opt.arb database.

epruesse commented 3 years ago

Just be aware you might get dups on the acc alone. In SILVA acc + start uniquely identify a SSU/LSU sequence, with start being the first base of the sequence within its accession number sequence.