almeidasilvaf / syntenet

An R package to infer and analyze synteny networks from protein sequences
https://almeidasilvaf.github.io/syntenet/
21 stars 6 forks source link

unique species id in `process_input()` #15

Closed iaindhay closed 1 year ago

iaindhay commented 1 year ago

Is there any way to customize the behavior of the renaming aspect of the process_input() where it adds a unique species identifier to sequence names - e.g. allowing it to use a genome accession as the unique species id? e.g. XX_000000

almeidasilvaf commented 1 year ago

Hi, @iaindhay

The IDs are created from list names, and they must have 3-5 characters only. Under the hood, the function that creates IDs (create_species_id_table(), see documentation here) takes list names and extracts the first 3 characters; if there are repeated IDs, it will try 4 characters; if there are repeated IDs, it will try 5 characters; if even with 5 characters there are still repeated IDs, it will use 4 characters + numbers.

For example, suppose your list names are:

> names(seq)
[1] "Arabidopsis_thaliana" "Arabidopsis_lyrata" "Brassica_rapa"

In this case, even if we try to use the first 5 characters, there would be repeated IDs ("Arabi" twice). Then, the function adds numbers to distinguish IDs. The unique IDs in this case would be c("Arabi", "Arab2", "Brass").

That said, if you want to use custom IDs, you can use them as list names (which will be used by create_species_id_table() to create the IDs), but bear in mind that only the first 3-5 characters will be used. If you want to use genome accessions with more than 5 characters, you will not be able to use the entire accession as IDs.

Best, Fabricio