DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Invalid columns in compact manifest for AnVIL #6110

Closed hannes-ucsc closed 4 days ago

hannes-ucsc commented 7 months ago

species for example, but there could be others. There may also be missing columns or extra columns. We didn't put a lot of effort into curating the manifest config for AnVIL. The default approach should be to include all fields as columns in a compact manifest and to derive the manifest config from the field mapping for the /index/… endpoints.

The HCA manifest is a hand-curated set of fields but we've learned that no matter what set of fields we pick, there will always be people requesting yet another field to be included, so maintaining that set became a whack-a-mole. There will always be people who request that fields be removed because they don't see a use case for those fields, which seems short-sighted/pedantic to me. The only valid reasons not to include a field should be technical limitations, e.g., size, cardinality.

The current manifest config for AnVIL also uses a naming convention that is inconsistent with the naming of fields in the /index/… endpoint responses. The Data Browser doesn't currently expose the compact manifest so we should be able to rename columns using the same dotted convention used for the /index/… endpoints which itself is based on the naming of fields in the AnVIL schema.

[edit] Quick link to important comment detailing the implementation plan.

hannes-ucsc commented 7 months ago

For demo, show compact manifest on anvilprod. Review column headers. Show that species column is gone and that organism_type is there. Compare manifest with /index/files response. Download a file using a URL from the manifest.

hannes-ucsc commented 6 months ago

Discovered during demo that the manifest contains additional columns compared to the /index/files endpoint, for example, activities.reference_assembly.

hannes-ucsc commented 6 months ago

Additionally, the service response contains the files.uuid field which the manifest omits. It seems to have the same value as files.document_id. It could be that the service response field is redundant.

hannes-ucsc commented 6 months ago

The manifest has files.file_size. The service response has files.size. Same for .file_name vs .name. And .file_url and .url

hannes-ucsc commented 6 months ago

files.accessible is missing from manifest.

achave11-ucsc commented 5 months ago

Assignee to consider next steps.

hannes-ucsc commented 2 months ago

Spike in anvilprod to systematically compare the set of columns in the manifest with that of the inner entity properties in the /index/files endpoint. The set comparison should yield three sets, that of columns only in the manifest, that of properties only the endpoint and that of fields with other inconsistencies, e.g. a type mismatch.

dsotirho-ucsc commented 2 months ago

Only in the manifest

diagnoses.document_id
diagnoses.source_datarepo_row_ids
diagnoses.diagnosis_id

Only in the /index/{entity} response

activities.accessible
biosamples.accessible
bundles.accessible
datasets.description
datasets.accessible
donors.accessible
files.version
files.uuid
files.size (duplicate of `files.file_size`)
files.name (duplicate of `files.file_name`)
files.accessible

Mismatches

manifest          | /index/{entity} repsonse
------------------------------------------
bundle_uuid       | bundles.bundle_uuid
bundle_version    | bundles.bundle_version
files.file_url    | files.url
source_id         | sources.source_id
source_spec       | sources.source_spec
dsotirho-ucsc commented 2 months ago

Continue spike to determine which of the duplicate endpoint response fields we can remove without breaking the Data Browser.

dsotirho-ucsc commented 2 months ago

It has been confirmed that the AnVIL Data Browser uses the file entity fields file_name and file_size, not name and size.

hannes-ucsc commented 2 months ago

There is no activities.accessible property in the /index/files endpoint response, so I'm afraid what @dsotirho-ucsc posted is still not correct. Assignee to read my instructions again and post a new comment with the corrected results. DO NOT EDIT THE PREVIOUS, INACCURATE COMMENT.

dsotirho-ucsc commented 2 months ago

Only in the /index/files response

files.accessible
files.name  (duplicate of `files.file_name`)
files.size  (duplicate of `files.file_size`)
files.uuid
files.version

Only in the manifest

activities.activity_id
activities.activity_table
activities.document_id
activities.reference_assembly
activities.source_datarepo_row_ids
biosamples.apriori_cell_type
biosamples.biosample_id
biosamples.document_id
biosamples.source_datarepo_row_ids
datasets.consent_group
datasets.data_modality
datasets.data_use_permission
datasets.document_id
datasets.owner
datasets.principal_investigator
datasets.registered_identifier
datasets.source_datarepo_row_ids
diagnoses.diagnosis_age
diagnoses.diagnosis_id
diagnoses.document_id
diagnoses.onset_age
diagnoses.source_datarepo_row_ids
donors.document_id
donors.donor_id
donors.source_datarepo_row_ids

Mismatches

index/files response    | manifest
------------------------|----------------------
bundles.bundle_uuid     | bundle_uuid    
bundles.bundle_version  | bundle_version
files.url               | files.file_url
sources.source_id       | source_id
sources.source_spec     | source_spec
hannes-ucsc commented 2 months ago

I've compiled the above spike results into a spreadsheet and decided on a path forward. There are three milestones: a PR against this issue to be filed now, with fixes to the manifest, a milestone that adds new names for more /index/… endpoint properties without breaking the browser, and a final milestone that removes old properties.

https://docs.google.com/spreadsheets/d/1xbKUdxkv4-Hv6rRALPSod6wxVJCckcC7v2e_ccmvFag/edit?usp=sharing

image
achave11-ucsc commented 2 months ago

Assignee to file PR with the changes specified in the "immediate remedy" column of the above spreadsheet.

hannes-ucsc commented 1 month ago

For demo, repeat the spike (before demo) and review differences (during demo). I would like to be present at that demo.