RNAcentral / rnacentral-webcode

RNAcentral website source code
https://rnacentral.org
Apache License 2.0
31 stars 8 forks source link

Postgress database query question #634

Open genomicsguy opened 1 month ago

genomicsguy commented 1 month ago

Could you provide an example query for how you create the GFF or GTF download for the human annotations.

Please could you also confirm that https://rnacentral.org/static/img/rnacentral_latest_schema.png is the latest schema. For example, the schema shows "RnaPrecomputed" as the main table, yet the table available through the postgress interface is called "rnc_rna_precomputed".

Thank you!

genomicsguy commented 1 month ago

Here another example. You have two tables "rnc_sequence_regions_active_mapped" and "rnc_sequence_regions_active_provided" within the rnacen schema. I am assuming one or both of these is related to "SequenceRegionActive" in the online schema, but it is not clear which, and the tables contain a column not listed online, "providing_databases."

blakesweeney commented 1 month ago

Hi, the query we use is:

SELECT
  json_build_object(
      'assembly_id', :'assembly_id',
      'region_id', max(regions.region_name),
      'rna_id', max(pre.id),
      'description', max(pre.short_description),
      'rna_type',  max(pre.rna_type),
      'databases', regexp_split_to_array(max(pre."databases"), ','),
      'providing_databases', array_agg(ac.database),
      'chromosome', max(regions.chromosome),
      'strand', max(regions.strand),
      'identity', max(regions.identity),
      'was_mapped', bool_or(regions.was_mapped),
      'exons', array_agg(distinct exons.*)
  )
FROM rnc_rna_precomputed pre
JOIN rnc_sequence_regions_active regions
ON
  regions.urs_taxid = pre.id
JOIN rnc_sequence_exons exons
ON
  exons.region_id = regions.id
join rnc_accession_sequence_region sra
    on sra.region_id = regions.id
join rnc_accessions ac
    on sra.accession = ac.accession
WHERE
  pre.is_active = true
  AND regions.assembly_id = :'assembly_id'
GROUP BY regions.id
ORDER BY max(regions.chromosome), max(regions.region_start), regions.id

The result of this query is turned into GFF/BED as needed. In the database we have a table of provided coordinates, and of mapped coordinates. We combine these to produce the final coordinates. Note that the table name in the schema diagram follows python naming conventions, while the table names do not. We probably should update the diagram to indicate table names, but they are generally a simple modification of the table names.