SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Partition locus/variant/annotation tables by contig #300

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

Working in branch partition_by_contig


Then add contig to every query, this would presumably speed up eg gene and variant queries (as we'd have much smaller table to scan)

Remember to look into the new Postgres partitions, if AWS supports it.

Variant - partition by alt = "=" (reference) may be useful

Otherwise you have to partition off a field in that table (ie no relations) so you can't eg partition variant by contig, unless you add contig to the table (duplicating 1/3 of Locus table)

Do at the same time as #12

davmlaw commented 3 years ago

AnalysisNodes could provide a set of contigs that they retrieve data from, then you can always pass that into querysets - so the optimizer can restrict to just those partitions

davmlaw commented 3 years ago

Would also be worth a test to see whether it's worth actually keeping Locus around, as maybe it's just easier to do a query with position / ref etc for the other alts

Django library to help:

https://django-postgres-extra.readthedocs.io/en/master/table_partitioning.html

davmlaw commented 3 years ago

Just going to do the absolute minimum to try these partitions out - will work out migration strategies later.

Had issue, which was resolved by upgrading to django-postgres-extra==2.0.2

Configured settings as per https://django-postgres-extra.readthedocs.io/en/master/installation.html OK

class Variant2(PostgresPartitionedModel):
    """ Plan is to create new version with same PK as old Variant """
    contig = models.ForeignKey(Contig, on_delete=CASCADE)
    position = models.IntegerField(db_index=True)
    ref = models.ForeignKey(Sequence, on_delete=CASCADE, related_name="+")
    alt = models.ForeignKey(Sequence, on_delete=CASCADE, related_name="+")

    class Meta:
        unique_together = ("contig", "position", "ref", "alt")

    # @see https://django-postgres-extra.readthedocs.io/en/master/table_partitioning.html
    class PartitioningMeta:
        method = PostgresPartitioningMethod.LIST
        key = ["contig_id"]

then

python3.8 manage.py pgmakemigrations

# Died with:

django.core.exceptions.ImproperlyConfigured: Model 'Variant2' is not properly configured to be partitioned. Field 'contig_id' in partitioning key ['contig_id'] is not a valid field on 'Variant2'.

This was due to contig being declared under partitioning_options - rearranged file to add:

                ('contig', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='snpdb.contig')),

Made some partitions - can do this dynamically doesn't have to be in data migration

def _variant_partitions(apps, schema_editor):
    Contig = apps.get_model("snpdb", "Contig")
    Variant2 = apps.get_model("snpdb", "Variant2")

    ASSEMBLED_MOLECULE = "AM"
    chrom_contigs = list(Contig.objects.filter(role=ASSEMBLED_MOLECULE).values_list("pk", flat=True))
    print(f"Creating partitions for {len(chrom_contigs)} chromosomes")
    for contig_id in chrom_contigs:
        schema_editor.add_list_partition(
            model=Variant2,
            name=f"contig_{contig_id}",
            values=[contig_id],
        )

Then copied variant into the new table:

insert into snpdb_variant2
(id, contig_id, position, ref_id, alt_id)
select snpdb_variant.id, contig_id, position, ref_id, alt_id from snpdb_variant
inner join snpdb_locus on (snpdb_locus.id = snpdb_variant.locus_id);

4,678,687 rows in 4m 6s

# To reset PK auto increment sequence
SELECT setval('snpdb_variant2_id_seq', (SELECT MAX(id) FROM snpdb_variant2));

Benchmark

select count(*) from snpdb_variant
nner join snpdb_locus on (snpdb_locus.id = snpdb_variant.locus_id)
where contig_id = 1;

279,909 in 1.527s, 1.453s, 1.408s

select count(*) from snpdb_variant2
where contig_id = 1;

279,909 in 42ms, 39ms, 37ms

OK, that is not a realistic benchmark but I'm happy!

davmlaw commented 3 years ago

More realistic test - retrieve the variants in a gene:

Old way

select count(*)
from
    "snpdb_variant"
inner join "snpdb_locus" on
    ("snpdb_variant"."locus_id" = "snpdb_locus"."id")
inner join "snpdb_contig" on
    ("snpdb_locus"."contig_id" = "snpdb_contig"."id")
inner join "snpdb_genomebuildcontig" on
    ("snpdb_contig"."id" = "snpdb_genomebuildcontig"."contig_id")
inner join "annotation_variantgeneoverlap_version_11" on
    ("snpdb_variant"."id" = "annotation_variantgeneoverlap_version_11"."variant_id")
left outer join "snpdb_variantallele" T11 on
    ("snpdb_variant"."id" = T11."variant_id")
where
    ("snpdb_genomebuildcontig"."genome_build_id" = 'GRCh38'
        and "annotation_variantgeneoverlap_version_11"."gene_id" in ('ENSG00000139618', '675'))

60ms, 66ms, 60ms

New way:

select count(*)
from snpdb_variant2
inner join "snpdb_contig" on
    ("snpdb_variant2"."contig_id" = "snpdb_contig"."id")
inner join "snpdb_genomebuildcontig" on
    ("snpdb_contig"."id" = "snpdb_genomebuildcontig"."contig_id")
inner join "annotation_variantgeneoverlap_version_11" on
    ("snpdb_variant2"."id" = "annotation_variantgeneoverlap_version_11"."variant_id")
where
    ("snpdb_genomebuildcontig"."genome_build_id" = 'GRCh38'
        and "annotation_variantgeneoverlap_version_11"."gene_id" in ('ENSG00000139618', '675')
        and snpdb_variant2.contig_id = 107)

14ms, 15ms 19ms

davmlaw commented 3 years ago

AnalysisNode

3 genes (STAG2, MPL, ZRSR2) on 2 contigs (1, X)

variant 178ms, 183ms, 176ms variant2 without contig joins - 1.479s, 1.462s, 1.484s variant2 with contig joins - 1.57s, 1.72s, 1.87s

Running query against snpdb_variant2 is much slower than snpdb_variant without changing query - I suspect this is due to (lack of) indexes - but that won't matter after we've changed and re-built them.

The gene based queries do almost all of their work joining to annotation_variantgeneoverlap - so to get a speed up we should also partition that by contig.

    from collections import defaultdict
    from annotation.models import ContigVariantAnnotationVersion, VariantGeneOverlap, VariantGeneOverlap2, \
        VariantAnnotationVersion
    version_contig = defaultdict(dict)
    for cvav in ContigVariantAnnotationVersion.objects.all():
        version_contig[cvav.version][cvav.contig_id] = cvav

    for vav in VariantAnnotationVersion.objects.all():
        vgo2_list = []
        contigs = version_contig[vav]
        vgo_qs = VariantGeneOverlap.objects.filter(version=vav)
        for annotation_run_id, variant_id, gene_id, contig_id in vgo_qs.values_list("annotation_run_id", "variant_id",
                                                                                    "gene_id",
                                                                                    "variant__locus__contig_id"):
            vgo2 = VariantGeneOverlap2(version=contigs[contig_id], annotation_run_id=annotation_run_id,
                                       variant_id=variant_id, gene_id=gene_id)
            vgo2_list.append(vgo2)
        print(f"Inserting for {len(vgo2_list)} {vav}")
        VariantGeneOverlap2.objects.bulk_create(vgo2_list, batch_size=2000)
davmlaw commented 3 years ago

You can't make a FK to an inherited table, tried and got:

IntegrityError: insert or update on table "analysis_allvariantsnode" violates foreign key constraint "analysis_allvariants_max_variant_id_d791bed5_fk_snpdb_var"
DETAIL:  Key (max_variant_id)=(6864990) is not present in table "snpdb_variant".

Looks like you just can’t do this in Postgres: https://stackoverflow.com/questions/24360312/foreign-keys-table-inheritance-in-postgresql

So, much easier to only use partitioning on "leaf" tables - ie that nobody references.

In PG12 you can make a FK to the new tables - from the postgres docs:

Limitations

Unique constraints (and hence primary keys) on partitioned tables must include all the partition key columns

So, any table has a composite primary key (2 columns), and thus any FK to it must also be a composite key of 2 columns. Django does not currently support this - and likely won't for a long time (has been open issue for a long time with no plans on even what to do)

There is a library, but it doesn't seem to work with Django 3+ but could try and patch it to work:

https://pypi.org/project/django-composite-foreignkey/

davmlaw commented 3 years ago

Tried partitioning VariantGeneOverlap by contig/version (made new model) but didn't seem to make any difference in performance (after adding filter to just the 2 I needed)

However, I did find that if you filter contig in locus, you get a big speed up, even without partitions, so probably best to do that 1st - see #451 - Filter by contig - we'll need to do that anyway to take advantage of partitions and it seems to work now

davmlaw commented 3 years ago

Looks like I can get Django Composite ForeignKey to work:

git clone https://github.com/onysos/django-composite-foreignkey
cd django-composite-foreignkey/
git fetch origin pull/18/head:django-latest
sudo python3.8 -m pip install .

Copied all 4M rows from variant into variant2

Then 1M rows into testsinglerelation(variant/score) and testrelation (2 FKs and score)

-- single relation - 489k rows in 1.8s -- composite - 489k rows in 3.449s

Seems like it spends time making a big query then going through all the partitions. If you include the contigs in the WHERE clause - it drops the search space massively, so it becomes comparable

-- single relation - 1.3s -- composite - 1.7s

The other thing is, those FKs are integers so don't have an index. I could try making that model managed=False then change integers to FKs and make an index

insert into snpdb_variant2 
(contig_id, position, ref_id, alt_id)
select sl.contig_id, sl.position, sl.ref_id, sv.alt_id
from snpdb_variant sv 
inner join snpdb_locus sl on (sl.id = sv.locus_id);

insert into snpdb_testrelated
(variant_pk, contig_pk, score)
select id, contig_id, floor(RANDOM()*100) from snpdb_variant2
order by id
limit 1000000;

insert into snpdb_testsinglerelated
(variant_id, score)
select id, floor(RANDOM()*100) from snpdb_variant
order by id
limit 1000000;
davmlaw commented 3 years ago

Not sure if django-composite-foreignkey actually make a proper Postgres FK - try doing it via postgres - https://stackoverflow.com/a/13133733/295724

Can we then make an index on that?

FOREIGN KEY (variant_pk, contig_id) REFERENCES variant2(id, contig_id)

Might be worth just adding it all everywhere and running benchmarks

Need to do data migration to add those contigs onto the table