Closed mshadbolt closed 4 years ago
@rolando-ebi @ESapenaVentura what projects do we currently have uploaded into dev with real files? do any of them fulfill the criteria we need to test for above?
In dev we currently only have one uploaded project (MS/unaffected brain project.
I don't think it fills any of the purposes, 100 GB of files is not really a big dataset.
We might want to test Peng He's dataset (had pretty big files) or one of the GEO datasets ray mentioned that had big files as well.
For the weird graph, I don't think we have any dataset (At least that I remember) that presents a weird graph structure. Maybe one of the GEO datasets that @ami-day has been wrangling? Usually weird graph structures come because specimens are pooled/there are cell lines implied, but we've been wrangling mostly primary tissue data which is usually pretty straight-forward (donor -> specimen -> cell suspension -> files)
Ok so the MS/unaffected brain project fulfills uploading a 'new' non-dcp1 project
I will work on doing a submission of the integration test that is modified to include a supplementary file
I think even though we want to focus on new projects it would be good to test the cell line, cell line, cell line project (not sure which one this is) and perhaps the organoids project which has cell suspensions from pooled organoids.
@ESapenaVentura do you remember for Rasa's project whether the individual files were large or it was just because there were a lot that made it 'big'
@rolando-ebi i have now submitted this project which is the 10X integration test with a linked supp file
https://ui.ingest.dev.archive.data.humancellatlas.org/submissions/detail?id=5ede1d56edbbef1c8f204f65
@mshadbolt Thanks! It looks good to me. The exporter ended up creating a folder for the supplementary_file:
$ gsutil ls gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/cell_suspension/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/collection_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/dissociation_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/donor_organism/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/enrichment_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/library_preparation_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/process/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/project/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/sequence_file/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/sequencing_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/specimen_from_organism/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/ <-----
There is one .json file inside:
$ gsutil ls gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/be57003a-f3dc-40a0-a313-84f7a51ba974_2020-06-08T11:13:30.546Z.json <---
...with contents:
{
"describedBy": "https://schema.dev.data.humancellatlas.org/type/file/2.2.0/supplementary_file",
"schema_type": "file",
"file_core": {
"file_name": "54654cc2-5168-4855-861f-c422d42f0ebd_2020-06-08T11:13:30.546Z_my_cool_protocol.pdf",
"format": "pdf",
"content_description": [
{
"text": "enrichment protocol",
"ontology": "data:2531",
"ontology_label": "Protocol"
}
],
"sha256": "f0815e087863fea7b1597553156f63532318f6838bad231ce807919288a9869b",
"crc32c": "15bdc970",
"sha1": "dae0861cb87a45c8bd524669f142d16b449ead45",
"s3_etag": "aaf1c01126ac9210ef5ba3365d0a3874",
"size": 18666,
"content_type": "application/pdf"
},
"file_description": "Single cell T cell sorting enrichment protocol"
}
And for links.json, the supplementary file link appears as the last entry in "links":
{
"links": [
{
"link_type": "process_link",
"process_id": "0277c692-e0ad-4202-9a1b-adacc590896c",
"process_type": "process",
"inputs": [
{
"input_type": "cell_suspension",
"input_id": "aac9f35e-e033-4a1a-b5db-0de0658f6bdb"
}
],
"outputs": [
{
"output_type": "sequence_file",
"output_id": "57f0e1a5-6fe8-4901-b93d-6ebdeed7bb4b"
},
{
"output_type": "sequence_file",
"output_id": "2083c0ec-38a5-4fc0-94ac-f2dec5c37250"
},
{
"output_type": "sequence_file",
"output_id": "4e412383-6223-4871-8451-d232ada374fd"
}
],
"protocols": [
{
"protocol_type": "library_preparation_protocol",
"protocol_id": "eb117fae-c530-459d-9425-1ff91574c269"
},
{
"protocol_type": "sequencing_protocol",
"protocol_id": "bcf5106a-1e07-4ca4-8d39-3b9663083877"
}
]
},
{
"link_type": "process_link",
"process_id": "a516624e-3110-4125-94e9-5cedc23fe981",
"process_type": "process",
"inputs": [
{
"input_type": "specimen_from_organism",
"input_id": "db490a04-3f94-470e-baf0-61ae3919a10b"
}
],
"outputs": [
{
"output_type": "cell_suspension",
"output_id": "aac9f35e-e033-4a1a-b5db-0de0658f6bdb"
}
],
"protocols": [
{
"protocol_type": "dissociation_protocol",
"protocol_id": "3954f01c-2720-445f-933d-995e4173ab9e"
},
{
"protocol_type": "enrichment_protocol",
"protocol_id": "02246522-ef92-4505-9630-ea40b07a3c56"
}
]
},
{
"link_type": "process_link",
"process_id": "1b0f28ae-d79b-4f80-8eb2-fd129b3c2beb",
"process_type": "process",
"inputs": [
{
"input_type": "donor_organism",
"input_id": "8d6a8d63-69c3-4934-930f-296ed2d3a4b8"
}
],
"outputs": [
{
"output_type": "specimen_from_organism",
"output_id": "db490a04-3f94-470e-baf0-61ae3919a10b"
}
],
"protocols": [
{
"protocol_type": "collection_protocol",
"protocol_id": "7353bf1e-732c-407d-a0e5-ee238e4adf78"
}
]
},
{ <--------------
"link_type": "supplementary_file_link",
"entity": {
"entity_type": "project",
"entity_id": "91b558e5-b400-4d01-9ef1-6973d872a4d7"
},
"files": [
{
"file_type": "supplementary_file",
"file_id": "be57003a-f3dc-40a0-a313-84f7a51ba974"
}
]
}
],
"describedBy": "https://schema.dev.data.humancellatlas.org/system/2.0.0/links",
"schema_version": "2.0.0",
"schema_type": "links"
}
Nice!
@rolando-ebi I have tee-d up the James Colon (big data) and Organoids (weird graph) submissions for tomorrow. Organoids is ready to go and Colon has the last few files validating. This is about 1.5 TB.
Let me know when would be good to hit submit! (or feel free to do it yourself if easier)
earlier I submitted the organoids submission and Rolando provided me back the metadata files for review. I checked the contents of one links.json file manually and am posting here what I wrote in slack so that it is easier to find again later
I think the linking all looks okay but one of the weird things about this project is that the cell_suspensions have multiple organoid inputs, but there is only metadata for the upstream inputs of one organoid within a links.json or what we used to call a 'bundle' . I don't know if this ends up causing issues down the line for other components. We definitely saw that it messed a bit with how the metadata within the matrix was displayed down the line. So I don't think it is wrong but when we point out the project to the downstream components maybe it is worth getting feedback on whether this way of organising the metadata is going to work for them
waiting on go ahead from @rolando-ebi to complete the large James Colon submission, I won't be in tomorrow though so might be up to someone else to hit submit
Ok I am going to try to explain this better. The example I will use is 3864ce3e-280d-4028-b633-bba5f4452c95_2020-06-09T09/13/52.966923Z_c5ff6a9f-723d-4ccf-8944-d89413fd86d0.json
Here is the graph structure in a diagram from the folder that might help visualise (i didnt draw it): https://docs.google.com/drawings/d/1hVdAdf0ED4AkzHC_ryXIv9hnvXxDiPVda_RAMBrE6Es/edit
(I am ignoring protocols for now because they are fine)
In this example we have links that go Donor->specimen->cell line->organoid->cell suspension->seq files
because (i think) we tend to build the experimental graph from left to right, we end up with the following links
input | output |
---|---|
1 donor | 1 specimen |
1 specimen | 1 cell line |
1 cell line | 1 organoid |
4 organoids | 1 cell suspension |
1 cell suspension | 6 sequencing files |
But because the organoids were pooled, if you built the graph from right to left you would end up with
input | output |
---|---|
4 donors | 4 specimens |
4 specimens | 4 cell lines |
4 cell lines | 4 organoids |
4 organoids | 1 cell suspension |
1 cell suspension | 6 sequencing files |
So I don't know if that is an issue or not, do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?
Does this make more sense? Or am I crazy?
actually thinking about it more I guess the way we are doing it is a bit wrong because every donor, specimen and cell line is actually in every sequencing file... hmmm
I am not really sure if we should even accept pooled files like this but wasn't involved in the original wrangling
honestly I think I chose a complicated example and the issues I have surfaced are not to do with the exporter @rolando-ebi
do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?
The assumption is from the perspective of the sequencing files i.e from individual assay process. I think things look ok from that perspective but we might want to review how we want to export.
Or to put another way: given an assay process, export all of the "dependent" experimental metadata
yeah I was thinking of a scenario in which I did a search, saw that there was a donor I was interested in, looked at the links.json
that contains that donor, currently I might assume that there was just data from 1 donor in the sequence files that are within that links.json
, unless I was really paying attention and noticed that the donor is only linked to 1 organoid, but 4 organoids when into the cell suspension.
but maybe azul has some smart way of interpreting these relationships.
@mshadbolt I've described how I'd expect links.json to be generated here: https://docs.google.com/drawings/d/1pf2ZlJQ0Ip0nZW5g5nSCLQj4rsHi-ko8EoyWZNbfD4U/edit?usp=sharing
Each colour groups eligible files for a single assay-process/links.json
@mshadbolt I've described how I'd expect links.json to be generated here: https://docs.google.com/drawings/d/1pf2ZlJQ0Ip0nZW5g5nSCLQj4rsHi-ko8EoyWZNbfD4U/edit?usp=sharing
Each colour groups eligible files for a single assay-process/links.json
@rolando-ebi 's grouping of the entities looks correct to me.
So I don't know if that is an issue or not, do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?
I also share @mshadbolt 's concerns about supporting this type of pooled data in the future. I think the interoperability of the data in this set suffers from the experimental design of pooling cell suspensions the way they did. I am not sure if you could link the data from an individual cell to an individual donor from just the metadata in this set. I think you would have to do do variant calling and then may be able to demultiplex the individuals based on sequence polymorphisms. I think it is very likely that correctly displaying all this complexity clearly to the users will be tricky if not impossible.
When we received cell type annotations for this project they had somehow demultiplexed the cell suspensions as the cells were assigned to individual organoids but am not sure how they did it and from the metadata that we provide it wouldn't be possible without going through some kind of process as Ray describes above.
@rolando-ebi I agree that I think it would be better to link all dependent entities from the sequence files going backwards as you drew in your diagram.
I am trying to check Kylie James' colon immune project, has lots of files and is both 10x and SS2, but I do not seem to have access:
Access Denied
You cannot access the resource: /submissions/detail?uuid=eff0cb74-4dd5-42f3-beab-7f8b247d3cbc&project=9147fb5e-78cd-41cd-83b9-376c2b293334
Do I need to be added to the wrangler list on this server?
Hi @rolando-ebi, Could you help me so I can check the validation status of the remaining test submission Kylie James' colon immune project, has lots of files and is both 10x and SS2? I do not seem to have access to it.
Plan for the remaining test submission:
I have a valid submission https://ui.ingest.staging.archive.data.humancellatlas.org/submissions/detail?uuid=fc3ae66a-bdc2-43d1-b3dd-e86a9d9c9569&project=e1774cf6-0acb-403a-ac01-a30c01929a3e waiting to be submitted.
- [ ] Check if the PR fixing a blocking issue has been merged before hitting submit.
@rolando-ebi , Please, let me know when it is OK for me to proceed with the testing.
Closing as this will be super-seeded by the 'dry run' of all MVP datasets
Description of the task:
In order to test the new exporter we need to upload projects in dev that will be exported to the GCP staging area.
The aim is to have 4 projects uploaded with all real files and exported into the GCP staging bucket.
Things we need to test:
links.json
schema changeAcceptance criteria for the task: