RNAcentral / rnacentral-sequence-search

RNAcentral sequence search cloud infrastructure
https://rnacentral.org/sequence-search
Apache License 2.0
2 stars 1 forks source link

Generate a representative subset of tRNAs #104

Open AntonPetrov opened 4 years ago

AntonPetrov commented 4 years ago

Some tRNA sequences get a large number of hits which causes a problem for sequence search, for example this sequence currently crashes it: GCGGAAGUAGUUCAGUGGUAGAACACCACCUUGCCAAGGUGGGGGUCGCGGGUUCGAAUCCCGUCUUCCGCUCCA.

This is not surprising given that we have >4 million tRNA sequences.

Let's try the same strategy as we used for rRNAs. For example, the following query whitelists ~110,000 tRNAs that excludes millions of sequences only found in ENA or Rfam:

https://rnacentral.org/search?q=rna_type:%22tRNA%22%20and%20(expert_db:%22gtRNAdb%22%20or%20expert_db:%22refseq%22%20or%20expert_db:%22ensembl%22%20or%20expert_db:%22hgnc%22%20or%20expert_db:%22flybase%22%20or%20expert_db:%22wormbase%22%20or%20expert_db:%22pombase%22%20or%20expert_db:%22TAIR%22%20or%20expert_db:%22SGD%22%20or%20expert_db:%22MGI%22%20or%20expert_db:%22dictybase%22%20or%20expert_db:%22PDBe%22)

@blakesweeney - would it be possible to create a set of, say, 5 whitelist-trna files and make all-except-rrna-trna instead of all-except-rrna files?