I've looked into this a bit, and this query may be problematic/slow when there are many records.
The current query looks something like:
SELECT
Sample.data AS "Sample_data",
Sample_entity_identifiers.identifier AS "Sample_root_identifier",
fastaSequence.data AS "fastaSequence_data",
fastaSequence_entity_identifiers.identifier AS "fastaSequence_root_identifier",
fastqMetadata.data AS "fastqMetadata_data",
fastqMetadata_entity_identifiers.identifier AS "fastqMetadata_root_identifier",
sample_photos.data AS "sample_photos_data",
sample_photos_entity_identifiers.identifier AS "sample_photos_root_identifier"
FROM project_1.Sample AS Sample
JOIN expeditions ON expeditions.id = Sample.expedition_id
LEFT JOIN project_1.fastaSequence AS fastaSequence
ON fastaSequence.parent_identifier = Sample.local_identifier AND fastaSequence.expedition_id = Sample.expedition_id
LEFT JOIN project_1.fastqMetadata AS fastqMetadata
ON fastqMetadata.parent_identifier = Sample.local_identifier AND fastqMetadata.expedition_id = Sample.expedition_id
LEFT JOIN project_1.sample_photos AS sample_photos
ON sample_photos.parent_identifier = Sample.local_identifier AND sample_photos.expedition_id = Sample.expedition_id
LEFT JOIN entity_identifiers AS Sample_entity_identifiers
ON Sample_entity_identifiers.expedition_id = Sample.expedition_id AND
Sample_entity_identifiers.concept_alias = 'Sample'
LEFT JOIN entity_identifiers AS fastaSequence_entity_identifiers
ON fastaSequence_entity_identifiers.expedition_id = fastaSequence.expedition_id AND
fastaSequence_entity_identifiers.concept_alias = 'fastaSequence'
LEFT JOIN entity_identifiers AS fastqMetadata_entity_identifiers
ON fastqMetadata_entity_identifiers.expedition_id = fastqMetadata.expedition_id AND
fastqMetadata_entity_identifiers.concept_alias = 'fastqMetadata'
LEFT JOIN entity_identifiers AS sample_photos_entity_identifiers
ON sample_photos_entity_identifiers.expedition_id = sample_photos.expedition_id AND
sample_photos_entity_identifiers.concept_alias = 'sample_photos'
WHERE expeditions.expedition_code = 'AGINA' AND Sample.local_identifier = 'Aleu45'
The issue I see when running explain analyze is that some nested loops are 1 time for each row returned (whichever child has the most rows).
This may be an issue if there are say 1000 tissues for a single sample.
Another way this query could be written is using a UNION...
SELECT
Sample.data AS "data",
Sample_entity_identifiers.identifier AS "root_identifier",
'Sample' AS "entity"
FROM project_1.Sample AS Sample
JOIN expeditions ON expeditions.id = Sample.expedition_id
LEFT JOIN entity_identifiers AS Sample_entity_identifiers
ON Sample_entity_identifiers.expedition_id = Sample.expedition_id AND
Sample_entity_identifiers.concept_alias = 'Sample'
WHERE expeditions.expedition_code = 'AGINA' AND Sample.local_identifier = 'Aleu45'
UNION SELECT
fastaSequence.data AS "data",
fastaSequence_entity_identifiers.identifier AS "root_identifier",
'fastaSequence' AS "entity"
FROM project_1.fastaSequence AS fastaSequence
JOIN expeditions ON expeditions.id = fastaSequence.expedition_id
LEFT JOIN entity_identifiers AS fastaSequence_entity_identifiers
ON fastaSequence_entity_identifiers.expedition_id = fastaSequence.expedition_id AND
fastaSequence_entity_identifiers.concept_alias = 'fastaSequence'
WHERE expeditions.expedition_code = 'AGINA' AND fastaSequence.parent_identifier = 'Aleu45'
UNION SELECT
sample_photos.data AS "data",
sample_photos_entity_identifiers.identifier AS "root_identifier",
'sample_photos' AS "entity"
FROM project_1.sample_photos AS sample_photos
JOIN expeditions ON expeditions.id = sample_photos.expedition_id
LEFT JOIN entity_identifiers AS sample_photos_entity_identifiers
ON sample_photos_entity_identifiers.expedition_id = sample_photos.expedition_id AND
sample_photos_entity_identifiers.concept_alias = 'sample_photos'
WHERE expeditions.expedition_code = 'AGINA' AND sample_photos.parent_identifier = 'Aleu45'
UNION SELECT
fastqMetadata.data AS "data",
fastqMetadata_entity_identifiers.identifier AS "root_identifier",
'fastqMetadata' AS "entity"
FROM project_1.fastqMetadata AS fastqMetadata
JOIN expeditions ON expeditions.id = fastqMetadata.expedition_id
LEFT JOIN entity_identifiers AS fastqMetadata_entity_identifiers
ON fastqMetadata_entity_identifiers.expedition_id = fastqMetadata.expedition_id AND
fastqMetadata_entity_identifiers.concept_alias = 'fastqMetadata'
WHERE expeditions.expedition_code = 'AGINA' AND fastqMetadata.parent_identifier = 'Aleu45';
This query doesn't seem to loop as many times, but isn't as performant for smaller datasets due to the unions instead of the joins.
This issue is to further investigate this when we have a larger dataset to experiment with.
I've looked into this a bit, and this query may be problematic/slow when there are many records.
The current query looks something like:
The issue I see when running
explain analyze
is that some nested loops are 1 time for each row returned (whichever child has the most rows).This may be an issue if there are say 1000 tissues for a single sample.
Another way this query could be written is using a
UNION
...This query doesn't seem to loop as many times, but isn't as performant for smaller datasets due to the unions instead of the joins.
This issue is to further investigate this when we have a larger dataset to experiment with.