RNAcentral / rnacentral-import-pipeline

RNAcentral data import pipeline
Apache License 2.0
2 stars 1 forks source link

Delete all entries in rnc_secondary_structure #148

Open blakesweeney opened 2 years ago

blakesweeney commented 2 years ago

This table stores the metadata about results (2D pairs, score, model, coordinates, etc) and it should be emptied out. R2DT has changed and we may (and likely will) have some sequences without hits. The only way to ensure we don't have mixed data is to remove all the old data.

afg1 commented 2 years ago

There's two tables that look like they do the same (ish) thing - rnc_secondary_structure contains the structure, md5 and accession, but rnc_secondary_structure_layout contains all the stuff with the model hits etc in. I'm guessing we want to empty the ..layout table? The structures don't appear to match between the two tables either?

blakesweeney commented 2 years ago

So this is a good chance to fix an issue with our database naming. the rnc_secondary_structure table is the result of getting 2D's when parsing. I think only 1 or 2 (gtRNAdb, CRW) databases provide it. This table is also likely not to be updated now that we have r2dt. I'd have to check the pipeline to confirm though. This one does not need to be deleted for this task.

The layout table is the one that is a result of r2dt and is the one to be emptied. It would probably we worthwhile to rename tables to reflect the differences between them. Maybe some name prefixed with r2dt?. I'll leave it up to you and @carlosribas to decide on naming.

afg1 commented 2 years ago

rnc_secondary_structure_layout has been backed up in rnc_secondary_structure_layout_backup and truncated.

I'll leave the issue open so we can figure out how to rename things when the rescan is done