Flesh out the implementation behind the facade for writes such that the wmg pipeline will write an s3 path that contains a data schema version

chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.

MIT License

63 stars 12 forks source link

Change the implementation behind the facade such that:

Include a data_schema_version in the schema package init file
write data to s3://cellxgene-wmg-prod/snapshots/<data_schema_version>/<snapshot_creation_timestamp>/ folder
write s3://cellxgene-wmg-prod/snapshots/<data_schema_version>/latest_snapshot_identifier file with value <snapshot_creation_timestamp>
write s3://cellxgene-wmg-prod/latest_snapshot_run - a file that is different from latest_snapshot_identifier in that it contains the path to the folder containing the latest data generated (whether the generated data passed validation or not) and can be different from latest_snapshot_identifier (ex: when data validation fails)
Ensure that removal of snapshots that occur after every successful run of wmg pipeline is done inside the <data_schema_version> folder in s3

One thing to keep in mind is that the construction of the marker genes cube uses load_snapshot().

This is important because the WMG API (the reader) uses load_snapshot() to read data. Currently, load_snapshot interface has been changed in anticipation of the new location from which snapshot should be read but it currently does not read from the new location (because data currently is not being produced in the new location). Similar to how a facade was introduced for the write path, the load_snapshot interface was modified - see this ticket

So what this all means, is that completing this ticket also means fleshing out load_snapshot but we need to introduce a flag or default parameter, say read_from_new_loc, such that if read_from_new_loc is true then data is read from new location, otherwise data is read from old location. So when constructing the marker genes cube, we would need data to be read from the new location.

After the writer has been deployed and verified that it indeed does write to the new location, the API can read from the new location as well - that is, the read_from_new_loc flag and logic gated by it can be entirely removed.

chanzuckerberg / single-cell-data-portal

Flesh out the implementation behind the facade for writes such that the wmg pipeline will write an s3 path that contains a data schema version #5166