Metadata spreadsheet downloadable from ingest for data consumer to contain the linking between biomaterials

gabsie commented 3 years ago

Note: this ticket has been changed to just include the task about the linking between biomaterials. The original description and epic is here.

As a data consumer, I want to be able to access and download the metadata spreadsheet for a project with as complete metadata as possible, with the ability to trace which file corresponds to which cell suspension, specimen and donor.

Note: for the data consumer version we are not including the empty columns from the default metadata spreadsheet, but only the ones which have values associated with them.

Acceptance Criteria / Definition of Done

Given a metadata spreadsheet per project is available to download from ingest
When I download the data consumer version, I should see the full metadata objects and their relationships and be able to view which files are related to the cells, specimens and donors information for that project.

gabsie commented 2 years ago

Have a separate design discussion to identify the tasks under this user value ticket.

aaclan-ebi commented 2 years ago

The ordering of columns should be the same as the original spreadsheet
The ordering of worksheets should be the same as the original spreadsheet
userfriendly name of metadata property
description of property should be in the spreadsheet

@gabsie @ESapenaVentura anything else?

gabsie commented 2 years ago

hi, as per our meeting today I have edited this ticket, in the description above. an important update - we're not considering the wrangler use case (I want to update a project with this spreadsheet), but just the data consumer case.

idazucchi commented 2 years ago

@aaclan-ebi and @gabsie to have a meeting about this today

MightyAx commented 2 years ago

Notes from meeting with @aaclan-ebi and @ESapenaVentura:

Download Links

Make the downloadable spreadsheet contain the linking between the biomaterials, and the processes, so that requirement 2 above could be fulfilled (include the columns which allow to see the connection, from the default metadata template)

We can implement this requirement, and remove the extraneous process tab with the following implementation:

For data that links to one or more process
Check for links to:
- protocols
- inputBiomaterials
- inputFiles
For each link use the describedBy field to determine the type of content being linked. This determines the column name in which to add the linked ID. (protocol_id, biomaterial_id, file_id)

Upload Links

It may not be mentioned as a requirement above but changing links via upload is explicitly out of scope for this user story.

Formatting Changes

Make the format of this spreadsheet contain row 1-5 from the original metadata template, which has got the user-friendly name, the description, examples, system name and separator line

These details can be pulled from the metadata and inserted with the correct font and formatting but should be a separate ticket.

Ordering Tabs & Columns

Try keep the ordering of columns as per original spreadsheet Try keep the ordering of spreadsheet tabs as per original spreadsheet

There might be a misundserstanding of how our excel files work. There is no one format that works, (@ESapenaVentura mentioned he routinely moves the columns around to make more sense to him) Columns and tabs can be added, moved and removed to some extent without any issues. Storing the column/tab ordering against every import and reusing this ordering upon download is a much bigger task than it may seem and is certainly out of scope for this ticket. What might be more achievable is to define a common ordering that will better serve our users and use this same ordering when generating blank files as well as re-generating files from existing projects.

Either way, these changes should be tracked on another ticket.

Endpoint

Make the downloadable metadata available as an end point

We're a little uncertain what is required here as we believe this is already available.

Do you mean spreadsheet download is available at a specific URL?
Do you mean metadata is available at an endpoint?

Project Catalogue Integration

Make the downloadable metadata available for download in the catalogue?

@aaclan-ebi is worried that this might get us in trouble with the DCP.

MightyAx commented 2 years ago

From conversation with @gabsie

The focus of a data consumer is only to provide columns that are populated, not blank columns.

ofanobilbao commented 2 years ago

This will not be taken in this sprint. It will probably be high priority next sprint

MightyAx commented 2 years ago

From conversation with @amnonkhen, regarding the Downloading Links functionality detailed above.

The spreadsheet is currently generated using granular API calls per entity/link etc. which takes time and is much more difficult to implement, because we join entities ourselves with the entire overhead it entails (dev time, quality, execution time) Instead the work should be done in ingest-core ( a new endpoint is a no brainer), and maybe even using more advanced mongodb queries to extract the links.

FAO @aaclan-ebi:

aaclan-ebi commented 2 years ago

@MightyAx and I will pair today to start working on this.

aaclan-ebi commented 2 years ago

@MightyAx and I brainstormed on the possible options how to implement this: https://miro.com/app/board/o9J_li2IEts=/

The current plan is:

Implement Ingest Core project endpoint to return a list of all objects and the relationships
Implement Spreadsheet generation

gabsie commented 2 years ago

@MightyAx @aaclan-ebi hopefully to work today on this.

MightyAx commented 2 years ago

We've successfully prototyped a submission "census", which is just the id's and relationship mappings of all objects: https://github.com/ebi-ait/ingest-core/pull/97

aaclan-ebi commented 2 years ago

WIP changes in the importer : https://github.com/ebi-ait/ingest-client/pull/32

aaclan-ebi commented 2 years ago

Please note ebi-ait/dcp-ingest-central#491 may block testing of this feature.

jacobwindsor commented 2 years ago

In PR review.

Might need to improve retrieving of linking information from core in order to download large spreadsheets faster.

aaclan-ebi commented 2 years ago

Hi @ami-day , the changes should already be in staging. Please verify.

It would be nice to upload a real spreadsheet from a dataset in prod to staging, do some updates in linking via ingest UI (by expanding a process row and making some changes) and download the spreadsheet to see if the spreadsheet shows the correct linking.

yusra-haider commented 2 years ago

ticket on wrangling test. to be tested by @ami-day

ami-day commented 2 years ago

Hi @aaclan-ebi , I was able to download a real spreadsheet from ingest prod. and it looks how it was initially, however, I am getting an error trying to re-upload it to staging: https://api.ingest.staging.archive.data.humancellatlas.org/submissionEnvelopes/61a0d0f54fe10b74b9ae5a27/submissionErrors

Maybe we could discuss on Monday when you're back.

amnonkhen commented 2 years ago

@aaclan-ebi to look into this today.

idazucchi commented 2 years ago

PR with the fixes is ready

ofanobilbao commented 2 years ago

@ami-day to review this today

ami-day commented 2 years ago

I will test this today

ami-day commented 2 years ago

Hi @aaclan-ebi , sorry for the delay in testing. But it works :) here is the submission: https://staging.contribute.data.humancellatlas.org/submissions/detail?uuid=01c167fa-0cb3-44bc-a392-1e7fa8d156ca All of the specimens were enriched by FACS and size selection. As a test I deleted the size selection enrichment protocol for specimen with ID SKN8090540 and output cell suspension ERX5053663. I then downloaded the spreadsheet, and I can see the size selection protocol is missing from that cell suspension only. Let me know if you need anymore testing for this ticket

gabsie commented 2 years ago

Thanks, @ami-day. Alegria, @aaclan-ebi - can we now put this on prod, and be able to demo this tomorrow at DCP demo? Thank you!

ami-day commented 2 years ago

@aaclan-ebi is monitoring deployment to production

aaclan-ebi commented 2 years ago

Deployed to prod today. Screenshot 2021-12-07 at 11.21.28.png