Closed iaindhay closed 1 year ago
Hi, @iaindhay
The IDs are created from list names, and they must have 3-5 characters only. Under the hood, the function that creates IDs (create_species_id_table()
, see documentation here) takes list names and extracts the first 3 characters; if there are repeated IDs, it will try 4 characters; if there are repeated IDs, it will try 5 characters; if even with 5 characters there are still repeated IDs, it will use 4 characters + numbers.
For example, suppose your list names are:
> names(seq)
[1] "Arabidopsis_thaliana" "Arabidopsis_lyrata" "Brassica_rapa"
In this case, even if we try to use the first 5 characters, there would be repeated IDs ("Arabi" twice). Then, the function adds numbers to distinguish IDs. The unique IDs in this case would be c("Arabi", "Arab2", "Brass")
.
That said, if you want to use custom IDs, you can use them as list names (which will be used by create_species_id_table()
to create the IDs), but bear in mind that only the first 3-5 characters will be used. If you want to use genome accessions with more than 5 characters, you will not be able to use the entire accession as IDs.
Best, Fabricio
Is there any way to customize the behavior of the renaming aspect of the
process_input()
where it adds a unique species identifier to sequence names - e.g. allowing it to use a genome accession as the unique species id? e.g. XX_000000