Closed theferrit32 closed 1 year ago
Proposed message format:
{
"release_date": "2023-02-08",
"bucket": "clinvar-releases",
"release_directory": "2023_03_08"
}
Current message format:
{
"release_date": "2022-02-05",
"bucket": "broad-dsp-monster-clingen-prod-ingest-results",
"files": [
"20220207T010000/clinical_assertion/created/000000000000",
"20220207T010000/clinical_assertion/deleted/000000000000",
"20220207T010000/clinical_assertion/updated/000000000000",
"20220207T010000/clinical_assertion_observation/created/000000000000",
"20220207T010000/clinical_assertion_observation/deleted/000000000000",
"20220207T010000/clinical_assertion_observation/updated/000000000000",
"20220207T010000/clinical_assertion_trait/created/000000000000",
"20220207T010000/clinical_assertion_trait/deleted/000000000000",
"20220207T010000/clinical_assertion_trait/updated/000000000000",
"20220207T010000/clinical_assertion_trait_set/created/000000000000",
"20220207T010000/clinical_assertion_trait_set/deleted/000000000000",
"20220207T010000/clinical_assertion_trait_set/updated/000000000000",
"20220207T010000/clinical_assertion_variation/created/000000000000",
"20220207T010000/clinical_assertion_variation/deleted/000000000000",
"20220207T010000/clinical_assertion_variation/updated/000000000000",
"20220207T010000/gene/created/000000000000",
"20220207T010000/gene/deleted/000000000000",
"20220207T010000/gene/updated/000000000000",
"20220207T010000/gene_association/created/000000000000",
"20220207T010000/gene_association/deleted/000000000000",
"20220207T010000/gene_association/updated/000000000000",
"20220207T010000/rcv_accession/created/000000000000",
"20220207T010000/rcv_accession/deleted/000000000000",
"20220207T010000/rcv_accession/updated/000000000000",
"20220207T010000/release_date.txt",
"20220207T010000/submission/created/000000000000",
"20220207T010000/submission/deleted/000000000000",
"20220207T010000/submission/updated/000000000000",
"20220207T010000/submitter/created/000000000000",
"20220207T010000/submitter/deleted/000000000000",
"20220207T010000/submitter/updated/000000000000",
"20220207T010000/trait/created/000000000000",
"20220207T010000/trait/deleted/000000000000",
"20220207T010000/trait/updated/000000000000",
"20220207T010000/trait_mapping/created/000000000000",
"20220207T010000/trait_mapping/deleted/000000000000",
"20220207T010000/trait_mapping/updated/000000000000",
"20220207T010000/trait_set/created/000000000000",
"20220207T010000/trait_set/deleted/000000000000",
"20220207T010000/trait_set/updated/000000000000",
"20220207T010000/variation/created/000000000000",
"20220207T010000/variation/deleted/000000000000",
"20220207T010000/variation/updated/000000000000",
"20220207T010000/variation_archive/created/000000000000",
"20220207T010000/variation_archive/deleted/000000000000",
"20220207T010000/variation_archive/updated/000000000000"
]
}
We can also then add diff releases DSP has already processed after:
{
"release_date": "2022-02-01",
"bucket": "clinvar-releases",
"release_directory": "2022-02-01"
}
{
"release_date": "2022-02-05",
"bucket": "broad-dsp-monster-clingen-prod-ingest-results",
"release_directory": "20220207T010000"
}
…
We will need to grant storage read permission for the service account our dev and prod GKE clusters use to access the clinvar-releases
bucket in the clingen-stage project.
@larrybabb has created a directory in cloud storage that contains
created
files for the2023-02-08
release of ClinVar. It matches the directory structure of the dsp diff directories, so a release messages can be created for it.One issue may be that the bigquery export to these files created a very large number of relatively small files, which could be an issue for creating the release message in kafka. The file listing is in the messages and I think there is a maximum message size. If this is a problem, then we will need to either combine these files into a smaller number, or change the release message format to enable just passing the root of the release's directory tree, and have the client application do the file listing.
I created some python code to do this but we can do it with the java libraries too https://github.com/clingen-data-model/clinvar-streams/blob/86896a6ceebc3ea30a2d0467f433efbedfd10c43/stream-repair/make-release-notification.py#L55-L76
The clinvar-raw stream reader code will need to be updated. Either add a flag to the notification message to say that it is a directory, not a file listing, or look at the file listing and see that it just has a directory in it. Probably adding a flag like
directory_only: true
, orfile_listing_type: directory
would be better. If not present, it would interpretfiles: []
as the file listing. Or maybe removingfiles: [<>… ]
and replacing it withdirectory: "<>"
would be better, and switch on the presence of these fields.