Rfam / rfam-family-pipeline

Backend for the Rfam family building pipeline
3 stars 2 forks source link

Family inconsistency between rfam_live and SVN #58

Closed kalvari closed 3 years ago

kalvari commented 3 years ago

There is some inconsistency between the families existing in rfam_live database and the SVN repo, which would need to be resolved.

The corresponding Rfam family accessions are listed below:

These accessions belong to miRNA families, which were recently updated from miRBase.

For example: RF03551 represents miRNA mir-506 along with accessions RF03529 and RF01910 the latter of which is the old/initial accession assigned to this family. Attempting to re-commit families with ids already existing in Rfam, which create a new entry in rfam_live, but SVN commit will fail resulting in "ghost families".

All associated entries can be found by executing the following query:

Select * from family where rfam_id='mir-506'

This query will return the following 3 accessions for mir-506:

Solution:

  1. Find all associated miRNA entries from rfam_live
  2. Checkout oldest families from the SVN using rfco.pl (e.g. RF01910)
  3. Replace old SEED with the updated one from miRBase
  4. Rerun rfsearch.pl followed by rfmake.pl (for thresholds see the relevant report)
  5. Update DESC with miRBase latest literature ref using add_ref.pl
  6. Ensure family passes QC using rqc-all.pl
  7. Recommit family back to Rfam using rfci.pl
  8. Delete redundant entries from rfam_live (e.g. RF03529, RF03551)

Note: rfkill.pl does not work in this case because there are no entities in the SVN repository for accessions RF03529, RF03551. Hence the term "ghost families".

kalvari commented 3 years ago

@AntonPetrov This should be scheduled for release 14.6

kalvari commented 3 years ago

The following rfam_id correspond to the families to fix, along with the number of associated accessions:

kalvari commented 3 years ago

Query to fetch miRNA families with >1 accessions:

select rfam_id, count(rfam_acc) as test from family
where type like '%miRNA%'
group by rfam_id
having test > 1;
AntonPetrov commented 3 years ago

Note an abnormal subfolder mir-278/ in the following SVN directory: https://xfamsvn.ebi.ac.uk/svn/data_repos/trunk/Families/RF00729/

AntonPetrov commented 3 years ago

As indicated by Ioanna, this problem happens when a new family is committed with an existing ID. The pipeline adds the family to RfamLive but crashes while committing to the SVN so the family is only in the database (and on the website) but not in the SVN (so there is no CM). This is how we get šŸ‘» families.

I identified all the affected families using select rfam_id from family group by rfam_id having count(*) > 1; and went through them one by one.

I deleted all families that were only in database and not in the SVN using a new jiffy script kill_family_in_db_no_svn.pl which is essentially rfkill.pl that does not use any of the SVN perl classes. These IDs are in the dead_family table now.

I re-created the families using unique IDs where necessary and made sure that they were added both to the SVN and the database.

Once @nawrockie adds a QC that ensures that rfnew.pl will reject DESC files if an ID is already in the database, this problem should not happen again. šŸ¤ž

AntonPetrov commented 3 years ago

A new QC has been added āœ