SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Sync - check for any Classifications that have historically matched #1142

Open davmlaw opened 3 weeks ago

davmlaw commented 3 weeks ago

If a Classification is changed such that the latest ClassificationModification is no longer caught by the Syncrunner, it won't send an update. An example is changing to Somatic, then to not-somatic for the somatic Shariant upload

I think if a sync destination has ever sent a classification it should be responsible for it for all time. So need to add it to filters

in sync.shariant.variant_grid_upload.VariantGridUploadSyncer.records_to_sync

Instead of:

qs = qs.filter(q)

You go:

    already_sync_q = Q(classification__classificationmodification__classificationmodificationsyncrecord__run__destination=self.sync_destination)
    qs = qs.filter(q | already_sync_q)

Dave to go back to SA Path and check if this has happened to any existing records and report back here

davmlaw commented 3 weeks ago

@TheMadBug has the new germline uploader become much more strict??

from classification.models import ClassificationModification, ShareLevel
from django.db.models import Q

from sync.models import SyncDestination
from sync.shariant.shariant_upload import ClassificationUploader, SyncDestination, QueryJsonFilter

for sd in SyncDestination.objects.filter(config__direction='upload', enabled=True):
    already_sync_q = Q(classification__classificationmodification__classificationmodificationsyncrecord__run__destination=sd)
    uploader = ClassificationUploader(sd)
    qs = ClassificationModification.objects.filter(is_last_published=True, share_level__in=ShareLevel.DISCORDANT_LEVEL_KEYS, classification__lab__group_name__in=uploader.lab_mappings.keys())
    q = QueryJsonFilter.classification_value_filter().convert_to_q(uploader.filters)
    prev_not_current_sync_qs = qs.filter(already_sync_q).exclude(q).distinct()
    current_sync_qs = qs.filter(q).distinct()
    print(f"{sd} - current sync: {current_sync_qs.count()}, historical not current: {prev_not_current_sync_qs.count()}")

    lab_records = ','.join([cm.classification.lab_record_id for cm in prev_not_current_sync_qs])
    print(f"Lab records: {lab_records}")
shariant_upload - current sync: 157, historical not current: 20
Lab records: vc15228,vc13072,vc13075,vc13113,vc15801,vc32032,vc36255,vc36511,vc29980,vc23033,vc15750,vc15833,vc23136,vc15230,vc50244,vc50339,vc51761,vc51760,vc51556,vc30255
shariant_upload_somatic - current sync: 7, historical not current: 2
Lab records: vc15911,vc15731

The shariant_upload_somatic ones are the wrong ones that got withdrawn

TheMadBug commented 3 weeks ago

Yes, the somatic/germline filter was updated to exclude records that haven't provided an allele origin at all (there's talk to make it mandatory within SA Path but we still need to organise an official convo with the lab heads about that).

re shariant_upload - current_sync_count=156, including historical sync would add: 2391

are 2391 recors for SA Path without an allele origin?

davmlaw commented 3 weeks ago

@TheMadBug - I updated the counts, I think I missed up the queries - they are much lower now

Given that current upload sync total is 157 + 7, it looks like we're missing a lot of historical records that we have uploaded

This looks to be part of the original query (before any sync dest filters are applied) - I'm not sure if they are due to labs changing or whatever, but it's possible they are being updated and wouldn't be sent/updated