clingen-data-model / clinvar-streams

1 stars 0 forks source link

Load truncated clinvar-raw stream #78

Closed theferrit32 closed 1 year ago

theferrit32 commented 1 year ago

@larrybabb has created a directory in cloud storage that contains created files for the 2023-02-08 release of ClinVar. It matches the directory structure of the dsp diff directories, so a release messages can be created for it.

One issue may be that the bigquery export to these files created a very large number of relatively small files, which could be an issue for creating the release message in kafka. The file listing is in the messages and I think there is a maximum message size. If this is a problem, then we will need to either combine these files into a smaller number, or change the release message format to enable just passing the root of the release's directory tree, and have the client application do the file listing.

I created some python code to do this but we can do it with the java libraries too https://github.com/clingen-data-model/clinvar-streams/blob/86896a6ceebc3ea30a2d0467f433efbedfd10c43/stream-repair/make-release-notification.py#L55-L76

The clinvar-raw stream reader code will need to be updated. Either add a flag to the notification message to say that it is a directory, not a file listing, or look at the file listing and see that it just has a directory in it. Probably adding a flag like directory_only: true, or file_listing_type: directory would be better. If not present, it would interpret files: [] as the file listing. Or maybe removing files: [<>… ] and replacing it with directory: "<>" would be better, and switch on the presence of these fields.

theferrit32 commented 1 year ago

Proposed message format:

{
    "release_date": "2023-02-08",
    "bucket": "clinvar-releases",
    "release_directory": "2023_03_08"
}

Current message format:

{
    "release_date": "2022-02-05",
    "bucket": "broad-dsp-monster-clingen-prod-ingest-results",
    "files": [
      "20220207T010000/clinical_assertion/created/000000000000",
      "20220207T010000/clinical_assertion/deleted/000000000000",
      "20220207T010000/clinical_assertion/updated/000000000000",
      "20220207T010000/clinical_assertion_observation/created/000000000000",
      "20220207T010000/clinical_assertion_observation/deleted/000000000000",
      "20220207T010000/clinical_assertion_observation/updated/000000000000",
      "20220207T010000/clinical_assertion_trait/created/000000000000",
      "20220207T010000/clinical_assertion_trait/deleted/000000000000",
      "20220207T010000/clinical_assertion_trait/updated/000000000000",
      "20220207T010000/clinical_assertion_trait_set/created/000000000000",
      "20220207T010000/clinical_assertion_trait_set/deleted/000000000000",
      "20220207T010000/clinical_assertion_trait_set/updated/000000000000",
      "20220207T010000/clinical_assertion_variation/created/000000000000",
      "20220207T010000/clinical_assertion_variation/deleted/000000000000",
      "20220207T010000/clinical_assertion_variation/updated/000000000000",
      "20220207T010000/gene/created/000000000000",
      "20220207T010000/gene/deleted/000000000000",
      "20220207T010000/gene/updated/000000000000",
      "20220207T010000/gene_association/created/000000000000",
      "20220207T010000/gene_association/deleted/000000000000",
      "20220207T010000/gene_association/updated/000000000000",
      "20220207T010000/rcv_accession/created/000000000000",
      "20220207T010000/rcv_accession/deleted/000000000000",
      "20220207T010000/rcv_accession/updated/000000000000",
      "20220207T010000/release_date.txt",
      "20220207T010000/submission/created/000000000000",
      "20220207T010000/submission/deleted/000000000000",
      "20220207T010000/submission/updated/000000000000",
      "20220207T010000/submitter/created/000000000000",
      "20220207T010000/submitter/deleted/000000000000",
      "20220207T010000/submitter/updated/000000000000",
      "20220207T010000/trait/created/000000000000",
      "20220207T010000/trait/deleted/000000000000",
      "20220207T010000/trait/updated/000000000000",
      "20220207T010000/trait_mapping/created/000000000000",
      "20220207T010000/trait_mapping/deleted/000000000000",
      "20220207T010000/trait_mapping/updated/000000000000",
      "20220207T010000/trait_set/created/000000000000",
      "20220207T010000/trait_set/deleted/000000000000",
      "20220207T010000/trait_set/updated/000000000000",
      "20220207T010000/variation/created/000000000000",
      "20220207T010000/variation/deleted/000000000000",
      "20220207T010000/variation/updated/000000000000",
      "20220207T010000/variation_archive/created/000000000000",
      "20220207T010000/variation_archive/deleted/000000000000",
      "20220207T010000/variation_archive/updated/000000000000"
    ]
  }

We can also then add diff releases DSP has already processed after:

{
    "release_date": "2022-02-01",
    "bucket": "clinvar-releases",
    "release_directory": "2022-02-01"
}
{
    "release_date": "2022-02-05",
    "bucket": "broad-dsp-monster-clingen-prod-ingest-results",
    "release_directory": "20220207T010000"
}
…
theferrit32 commented 1 year ago

We will need to grant storage read permission for the service account our dev and prod GKE clusters use to access the clinvar-releases bucket in the clingen-stage project.