Closed hannes-ucsc closed 4 days ago
For demo, show compact manifest on anvilprod
. Review column headers. Show that species
column is gone and that organism_type
is there. Compare manifest with /index/files
response. Download a file using a URL from the manifest.
Discovered during demo that the manifest contains additional columns compared to the /index/files
endpoint, for example, activities.reference_assembly
.
Additionally, the service response contains the files.uuid
field which the manifest omits. It seems to have the same value as files.document_id
. It could be that the service response field is redundant.
The manifest has files.file_size
. The service response has files.size
. Same for .file_name
vs .name
. And .file_url
and .url
files.accessible
is missing from manifest.
Assignee to consider next steps.
Spike in anvilprod
to systematically compare the set of columns in the manifest with that of the inner entity properties in the /index/files
endpoint. The set comparison should yield three sets, that of columns only in the manifest, that of properties only the endpoint and that of fields with other inconsistencies, e.g. a type mismatch.
diagnoses.document_id
diagnoses.source_datarepo_row_ids
diagnoses.diagnosis_id
activities.accessible
biosamples.accessible
bundles.accessible
datasets.description
datasets.accessible
donors.accessible
files.version
files.uuid
files.size (duplicate of `files.file_size`)
files.name (duplicate of `files.file_name`)
files.accessible
manifest | /index/{entity} repsonse
------------------------------------------
bundle_uuid | bundles.bundle_uuid
bundle_version | bundles.bundle_version
files.file_url | files.url
source_id | sources.source_id
source_spec | sources.source_spec
Continue spike to determine which of the duplicate endpoint response fields we can remove without breaking the Data Browser.
It has been confirmed that the AnVIL Data Browser uses the file
entity fields file_name
and file_size
, not name
and size
.
There is no activities.accessible
property in the /index/files
endpoint response, so I'm afraid what @dsotirho-ucsc posted is still not correct. Assignee to read my instructions again and post a new comment with the corrected results. DO NOT EDIT THE PREVIOUS, INACCURATE COMMENT.
files.accessible
files.name (duplicate of `files.file_name`)
files.size (duplicate of `files.file_size`)
files.uuid
files.version
activities.activity_id
activities.activity_table
activities.document_id
activities.reference_assembly
activities.source_datarepo_row_ids
biosamples.apriori_cell_type
biosamples.biosample_id
biosamples.document_id
biosamples.source_datarepo_row_ids
datasets.consent_group
datasets.data_modality
datasets.data_use_permission
datasets.document_id
datasets.owner
datasets.principal_investigator
datasets.registered_identifier
datasets.source_datarepo_row_ids
diagnoses.diagnosis_age
diagnoses.diagnosis_id
diagnoses.document_id
diagnoses.onset_age
diagnoses.source_datarepo_row_ids
donors.document_id
donors.donor_id
donors.source_datarepo_row_ids
index/files response | manifest
------------------------|----------------------
bundles.bundle_uuid | bundle_uuid
bundles.bundle_version | bundle_version
files.url | files.file_url
sources.source_id | source_id
sources.source_spec | source_spec
I've compiled the above spike results into a spreadsheet and decided on a path forward. There are three milestones: a PR against this issue to be filed now, with fixes to the manifest, a milestone that adds new names for more /index/… endpoint properties without breaking the browser, and a final milestone that removes old properties.
https://docs.google.com/spreadsheets/d/1xbKUdxkv4-Hv6rRALPSod6wxVJCckcC7v2e_ccmvFag/edit?usp=sharing
Assignee to file PR with the changes specified in the "immediate remedy" column of the above spreadsheet.
For demo, repeat the spike (before demo) and review differences (during demo). I would like to be present at that demo.
species
for example, but there could be others. There may also be missing columns or extra columns. We didn't put a lot of effort into curating the manifest config for AnVIL. The default approach should be to include all fields as columns in a compact manifest and to derive the manifest config from the field mapping for the/index/…
endpoints.The HCA manifest is a hand-curated set of fields but we've learned that no matter what set of fields we pick, there will always be people requesting yet another field to be included, so maintaining that set became a whack-a-mole. There will always be people who request that fields be removed because they don't see a use case for those fields, which seems short-sighted/pedantic to me. The only valid reasons not to include a field should be technical limitations, e.g., size, cardinality.
The current manifest config for AnVIL also uses a naming convention that is inconsistent with the naming of fields in the
/index/…
endpoint responses. The Data Browser doesn't currently expose the compact manifest so we should be able to rename columns using the same dotted convention used for the/index/…
endpoints which itself is based on the naming of fields in the AnVIL schema.[edit] Quick link to important comment detailing the implementation plan.