Russel88 / CRISPRCasTyper

CCTyper: Automatic detection and subtyping of CRISPR-Cas operons
https://typer.crispr.dk
MIT License
93 stars 17 forks source link

Source for repeats.fa #54

Closed qbilius closed 2 months ago

qbilius commented 3 months ago

Hi,

Thanks for your great work!

I've been struggling to identify how the repeats.fa file was created. Say I wanted to identify the source assembly for

>V-A_862
TCTACAATAGTAGAAATTTAATATATCTGTTAGAC

But running a blastn search online with default parameters fails to return any exact matches.

The article states that the sources for repeats.fa are Makarova et al. (2020) and Pinilla-Redondo et al. (2019). Since the latter focuses on Type IV systems, I looked up Makarova's data source and they seem to be solely from NCBI, thus blastn should find matching repeats, but it doesn't.

Could you perhaps clarify where these repeat sequences came from? Perhaps there is some index file, showing to the organism / assembly that, say, V-A_862 came from?

Thanks for you kind help, Jonas

Russel88 commented 2 months ago

Hi Jonas

No. it is not possible to link a repeat from repeats.fa to an assembly. The curated repeatTyper model was build from Makarova et al. (2020) and Pinilla-Redondo et al. (2019). But the model has been continuously updated, and repeats.fa has also been updated. It has been supplemented with GTDB genomes, Ensembl genomes, and also from data uploaded to the cctyper webserver as stated in the paper. Data uploaded to the webserver is not saved (only repeat, subtype, and some information to avoid duplicates is saved) so the repeat could be from a non-public source.

/ Jakob