biocodellc / biocode-fims-commons

Biocode Field Information Management System
3 stars 0 forks source link

Investigate RecordService.get query when includeChildren = true #46

Open rodney757 opened 6 years ago

rodney757 commented 6 years ago

I've looked into this a bit, and this query may be problematic/slow when there are many records.

The current query looks something like:

SELECT
  Sample.data                                 AS "Sample_data",
  Sample_entity_identifiers.identifier        AS "Sample_root_identifier",
  fastaSequence.data                          AS "fastaSequence_data",
  fastaSequence_entity_identifiers.identifier AS "fastaSequence_root_identifier",
  fastqMetadata.data                          AS "fastqMetadata_data",
  fastqMetadata_entity_identifiers.identifier AS "fastqMetadata_root_identifier",
  sample_photos.data                          AS "sample_photos_data",
  sample_photos_entity_identifiers.identifier AS "sample_photos_root_identifier"
FROM project_1.Sample AS Sample
  JOIN expeditions ON expeditions.id = Sample.expedition_id
  LEFT JOIN project_1.fastaSequence AS fastaSequence
    ON fastaSequence.parent_identifier = Sample.local_identifier AND fastaSequence.expedition_id = Sample.expedition_id
  LEFT JOIN project_1.fastqMetadata AS fastqMetadata
    ON fastqMetadata.parent_identifier = Sample.local_identifier AND fastqMetadata.expedition_id = Sample.expedition_id
  LEFT JOIN project_1.sample_photos AS sample_photos
    ON sample_photos.parent_identifier = Sample.local_identifier AND sample_photos.expedition_id = Sample.expedition_id
  LEFT JOIN entity_identifiers AS Sample_entity_identifiers
    ON Sample_entity_identifiers.expedition_id = Sample.expedition_id AND
       Sample_entity_identifiers.concept_alias = 'Sample'
  LEFT JOIN entity_identifiers AS fastaSequence_entity_identifiers
    ON fastaSequence_entity_identifiers.expedition_id = fastaSequence.expedition_id AND
       fastaSequence_entity_identifiers.concept_alias = 'fastaSequence'
  LEFT JOIN entity_identifiers AS fastqMetadata_entity_identifiers
    ON fastqMetadata_entity_identifiers.expedition_id = fastqMetadata.expedition_id AND
       fastqMetadata_entity_identifiers.concept_alias = 'fastqMetadata'
  LEFT JOIN entity_identifiers AS sample_photos_entity_identifiers
    ON sample_photos_entity_identifiers.expedition_id = sample_photos.expedition_id AND
       sample_photos_entity_identifiers.concept_alias = 'sample_photos'
WHERE expeditions.expedition_code = 'AGINA' AND Sample.local_identifier = 'Aleu45' 

The issue I see when running explain analyze is that some nested loops are 1 time for each row returned (whichever child has the most rows).

This may be an issue if there are say 1000 tissues for a single sample.

Another way this query could be written is using a UNION...

SELECT
  Sample.data                          AS "data",
  Sample_entity_identifiers.identifier AS "root_identifier",
  'Sample'                             AS "entity"
FROM project_1.Sample AS Sample
  JOIN expeditions ON expeditions.id = Sample.expedition_id
  LEFT JOIN entity_identifiers AS Sample_entity_identifiers
    ON Sample_entity_identifiers.expedition_id = Sample.expedition_id AND
       Sample_entity_identifiers.concept_alias = 'Sample'
WHERE expeditions.expedition_code = 'AGINA' AND Sample.local_identifier = 'Aleu45'
UNION SELECT
        fastaSequence.data                          AS "data",
        fastaSequence_entity_identifiers.identifier AS "root_identifier",
        'fastaSequence'                             AS "entity"
      FROM project_1.fastaSequence AS fastaSequence
        JOIN expeditions ON expeditions.id = fastaSequence.expedition_id
        LEFT JOIN entity_identifiers AS fastaSequence_entity_identifiers
          ON fastaSequence_entity_identifiers.expedition_id = fastaSequence.expedition_id AND
             fastaSequence_entity_identifiers.concept_alias = 'fastaSequence'
      WHERE expeditions.expedition_code = 'AGINA' AND fastaSequence.parent_identifier = 'Aleu45'
UNION SELECT
        sample_photos.data                          AS "data",
        sample_photos_entity_identifiers.identifier AS "root_identifier",
        'sample_photos'                             AS "entity"
      FROM project_1.sample_photos AS sample_photos
        JOIN expeditions ON expeditions.id = sample_photos.expedition_id
        LEFT JOIN entity_identifiers AS sample_photos_entity_identifiers
          ON sample_photos_entity_identifiers.expedition_id = sample_photos.expedition_id AND
             sample_photos_entity_identifiers.concept_alias = 'sample_photos'
      WHERE expeditions.expedition_code = 'AGINA' AND sample_photos.parent_identifier = 'Aleu45'
UNION SELECT
        fastqMetadata.data                          AS "data",
        fastqMetadata_entity_identifiers.identifier AS "root_identifier",
        'fastqMetadata'                             AS "entity"
      FROM project_1.fastqMetadata AS fastqMetadata
        JOIN expeditions ON expeditions.id = fastqMetadata.expedition_id
        LEFT JOIN entity_identifiers AS fastqMetadata_entity_identifiers
          ON fastqMetadata_entity_identifiers.expedition_id = fastqMetadata.expedition_id AND
             fastqMetadata_entity_identifiers.concept_alias = 'fastqMetadata'
      WHERE expeditions.expedition_code = 'AGINA' AND fastqMetadata.parent_identifier = 'Aleu45';

This query doesn't seem to loop as many times, but isn't as performant for smaller datasets due to the unions instead of the joins.

This issue is to further investigate this when we have a larger dataset to experiment with.