Rfam / rfam-3d-seed-alignments

A pipeline for adding RNA 3D structures to Rfam seed alignments
Apache License 2.0
1 stars 0 forks source link

Prevent duplicates #2

Closed AntonPetrov closed 2 years ago

AntonPetrov commented 2 years ago

For example, RF00013-6S/SsrS RNA has URS0000CBFF5C_562/1-144 twice (but the alignment is not identical).

AntonPetrov commented 2 years ago

Also check RF00003 where the same ID (URS000032B6B6_9606) occurs 4 times:

URS000032B6B6_9606/1-164                AUACUUACCUGGCAGG.GGAGA.UACC...AUGAUCAC.GAAG.GUGGU.UUUCCCA..GGG.CGAGGCUUAUCCAUU.....GCACUC.C....GG-AU.GUGCUGAC.........CCCUGCGAUUUCCCCAAA.uGUGG..GAA-ACUCG.ACUGCAUAAUUUGUGGUAG..UGGGGG.-ACUGCGUU..CGCGCUUUCCCCUG
#=GR URS000032B6B6_9606       3PGW_N_SS ...............................................................................................................................................................................................................
URS000032B6B6_9606/1-164                AUACUUACCUGGCAGG.GGAGA.UACC...AUGAUCAC.GAAG.GUGGU.UUUCCCA..GGG.CGAGGCUUAUCCAUU.....GCACUC.C....GG-AU.GUGCUGAC.........CCCUGCGAUUUCCCCAAA.uGUGG..GAA-ACUCG.ACUGCAUAAUUUGUGGUAG..UGGGGG.-ACUGCGUU..CGCGCUUUCCCCUG
#=GR URS000032B6B6_9606       3PGW_R_SS ...............................................................................................................................................................................................................
URS000032B6B6_9606/1-164                AUACUUACCUGGCAGG.GGAGA.UACC...AUGAUCAC.GAAG.GUGGU.UUUCCCA..GGG.CGAGGCUUAUCCAUU.....GCACUC.C....GG-AU.GUGCUGAC.........CCCUGCGAUUUCCCCAAA.uGUGG..GAA-ACUCG.ACUGCAUAAUUUGUGGUAG..UGGGGG.-ACUGCGUU..CGCGCUUUCCCCUG
#=GR URS000032B6B6_9606       6QX9_1_SS ............(((...((((.(.((...((............))))).))))..(..(((....(((.(((((....................)).)).).)))............))))..(.((((((........))..))).)..)...)))..................(((((..(.((((......)))).)))))).
URS000032B6B6_9606/1-164                AUACUUACCUGGCAGG.GGAGA.UACC...AUGAUCAC.GAAG.GUGGU.UUUCCCA..GGG.CGAGGCUUAUCCAUU.....GCACUC.C....GG-AU.GUGCUGAC.........CCCUGCGAUUUCCCCAAA.uGUGG..GAA-ACUCG.ACUGCAUAAUUUGUGGUAG..UGGGGG.-ACUGCGUU..CGCGCUUUCCCCUG
#=GR URS000032B6B6_9606       7B0Y_a_SS ............(((....(((...((....(............).))..)))......(((....(((.(((((....................)).)).).)))............))).....(((((((......)))..))).)......)))..................(((((..(.((((......)))).)))))).
AntonPetrov commented 2 years ago

In file RF00015 check ID URS00001143F5.

AntonPetrov commented 2 years ago

There are 2 problems in the RF01763 alignment: #=GR URS0000CBFF2B_2021 should be #=GR URS0000CBFF2B_2021/1-41 There are duplicate IDs URS0000CBFF2B_2021/1-41 because all the different PDBs have the same sequence