Upload projects to dev to test exporter

mshadbolt commented 4 years ago

Description of the task:

In order to test the new exporter we need to upload projects in dev that will be exported to the GCP staging area.

The aim is to have 4 projects uploaded with all real files and exported into the GCP staging bucket.

Things we need to test:

File size - upload a project with large data files and compare to the integration test project to benchmark export time
Correct supplementary file linking - one project with supplementary files to check the new links.json schema change
Correct graph linking - two projects with 'weird' graphs to ensure exporter writes the correct metadata without error

Acceptance criteria for the task:

[ ] 4 projects uploaded to dev ready for export with real data files
- [x] MS / brain (@ESapenaVentura )
- [ ] Project with 'big' data files
  - [ ] Kylie James' colon immune project, has lots of files and is both 10x and SS2 (@mshadbolt)
- [x] Project with supplementary files
  - [x] 10x Integration test with supp file(@mshadbolt)
- [x] Project with weird graph, also has supp files
  - [x] Cerebral Organoids(@mshadbolt) (dummy fastqs)

mshadbolt commented 4 years ago

@rolando-ebi @ESapenaVentura what projects do we currently have uploaded into dev with real files? do any of them fulfill the criteria we need to test for above?

ESapenaVentura commented 4 years ago

In dev we currently only have one uploaded project (MS/unaffected brain project.

I don't think it fills any of the purposes, 100 GB of files is not really a big dataset.

We might want to test Peng He's dataset (had pretty big files) or one of the GEO datasets ray mentioned that had big files as well.

For the weird graph, I don't think we have any dataset (At least that I remember) that presents a weird graph structure. Maybe one of the GEO datasets that @ami-day has been wrangling? Usually weird graph structures come because specimens are pooled/there are cell lines implied, but we've been wrangling mostly primary tissue data which is usually pretty straight-forward (donor -> specimen -> cell suspension -> files)

mshadbolt commented 4 years ago

Ok so the MS/unaffected brain project fulfills uploading a 'new' non-dcp1 project

I will work on doing a submission of the integration test that is modified to include a supplementary file

I think even though we want to focus on new projects it would be good to test the cell line, cell line, cell line project (not sure which one this is) and perhaps the organoids project which has cell suspensions from pooled organoids.

@ESapenaVentura do you remember for Rasa's project whether the individual files were large or it was just because there were a lot that made it 'big'

mshadbolt commented 4 years ago

@rolando-ebi i have now submitted this project which is the 10X integration test with a linked supp file

https://ui.ingest.dev.archive.data.humancellatlas.org/submissions/detail?id=5ede1d56edbbef1c8f204f65

rolando-ebi commented 4 years ago

@mshadbolt Thanks! It looks good to me. The exporter ended up creating a folder for the supplementary_file:

$ gsutil ls gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata

gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/cell_suspension/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/collection_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/dissociation_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/donor_organism/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/enrichment_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/library_preparation_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/process/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/project/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/sequence_file/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/sequencing_protocol/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/specimen_from_organism/
gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/    <-----

There is one .json file inside:

$ gsutil ls gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/

gs://broad-dsp-monster-hca-dev-ebi-staging/dev/metadata/supplementary_file/be57003a-f3dc-40a0-a313-84f7a51ba974_2020-06-08T11:13:30.546Z.json <---

...with contents:

{
  "describedBy": "https://schema.dev.data.humancellatlas.org/type/file/2.2.0/supplementary_file",
  "schema_type": "file",
  "file_core": {
    "file_name": "54654cc2-5168-4855-861f-c422d42f0ebd_2020-06-08T11:13:30.546Z_my_cool_protocol.pdf",
    "format": "pdf",
    "content_description": [
      {
        "text": "enrichment protocol",
        "ontology": "data:2531",
        "ontology_label": "Protocol"
      }
    ],
    "sha256": "f0815e087863fea7b1597553156f63532318f6838bad231ce807919288a9869b",
    "crc32c": "15bdc970",
    "sha1": "dae0861cb87a45c8bd524669f142d16b449ead45",
    "s3_etag": "aaf1c01126ac9210ef5ba3365d0a3874",
    "size": 18666,
    "content_type": "application/pdf"
  },
  "file_description": "Single cell T cell sorting enrichment protocol"
}

And for links.json, the supplementary file link appears as the last entry in "links":

{
  "links": [
    {
      "link_type": "process_link",
      "process_id": "0277c692-e0ad-4202-9a1b-adacc590896c",
      "process_type": "process",
      "inputs": [
        {
          "input_type": "cell_suspension",
          "input_id": "aac9f35e-e033-4a1a-b5db-0de0658f6bdb"
        }
      ],
      "outputs": [
        {
          "output_type": "sequence_file",
          "output_id": "57f0e1a5-6fe8-4901-b93d-6ebdeed7bb4b"
        },
        {
          "output_type": "sequence_file",
          "output_id": "2083c0ec-38a5-4fc0-94ac-f2dec5c37250"
        },
        {
          "output_type": "sequence_file",
          "output_id": "4e412383-6223-4871-8451-d232ada374fd"
        }
      ],
      "protocols": [
        {
          "protocol_type": "library_preparation_protocol",
          "protocol_id": "eb117fae-c530-459d-9425-1ff91574c269"
        },
        {
          "protocol_type": "sequencing_protocol",
          "protocol_id": "bcf5106a-1e07-4ca4-8d39-3b9663083877"
        }
      ]
    },
    {
      "link_type": "process_link",
      "process_id": "a516624e-3110-4125-94e9-5cedc23fe981",
      "process_type": "process",
      "inputs": [
        {
          "input_type": "specimen_from_organism",
          "input_id": "db490a04-3f94-470e-baf0-61ae3919a10b"
        }
      ],
      "outputs": [
        {
          "output_type": "cell_suspension",
          "output_id": "aac9f35e-e033-4a1a-b5db-0de0658f6bdb"
        }
      ],
      "protocols": [
        {
          "protocol_type": "dissociation_protocol",
          "protocol_id": "3954f01c-2720-445f-933d-995e4173ab9e"
        },
        {
          "protocol_type": "enrichment_protocol",
          "protocol_id": "02246522-ef92-4505-9630-ea40b07a3c56"
        }
      ]
    },
    {
      "link_type": "process_link",
      "process_id": "1b0f28ae-d79b-4f80-8eb2-fd129b3c2beb",
      "process_type": "process",
      "inputs": [
        {
          "input_type": "donor_organism",
          "input_id": "8d6a8d63-69c3-4934-930f-296ed2d3a4b8"
        }
      ],
      "outputs": [
        {
          "output_type": "specimen_from_organism",
          "output_id": "db490a04-3f94-470e-baf0-61ae3919a10b"
        }
      ],
      "protocols": [
        {
          "protocol_type": "collection_protocol",
          "protocol_id": "7353bf1e-732c-407d-a0e5-ee238e4adf78"
        }
      ]
    },
    {    <--------------
      "link_type": "supplementary_file_link",
      "entity": {
        "entity_type": "project",
        "entity_id": "91b558e5-b400-4d01-9ef1-6973d872a4d7"
      },
      "files": [
        {
          "file_type": "supplementary_file",
          "file_id": "be57003a-f3dc-40a0-a313-84f7a51ba974"
        }
      ]
    }
  ],
  "describedBy": "https://schema.dev.data.humancellatlas.org/system/2.0.0/links",
  "schema_version": "2.0.0",
  "schema_type": "links"
}

mshadbolt commented 4 years ago

Nice!

@rolando-ebi I have tee-d up the James Colon (big data) and Organoids (weird graph) submissions for tomorrow. Organoids is ready to go and Colon has the last few files validating. This is about 1.5 TB.

Let me know when would be good to hit submit! (or feel free to do it yourself if easier)

mshadbolt commented 4 years ago

earlier I submitted the organoids submission and Rolando provided me back the metadata files for review. I checked the contents of one links.json file manually and am posting here what I wrote in slack so that it is easier to find again later

I think the linking all looks okay but one of the weird things about this project is that the cell_suspensions have multiple organoid inputs, but there is only metadata for the upstream inputs of one organoid within a links.json or what we used to call a 'bundle' . I don't know if this ends up causing issues down the line for other components. We definitely saw that it messed a bit with how the metadata within the matrix was displayed down the line. So I don't think it is wrong but when we point out the project to the downstream components maybe it is worth getting feedback on whether this way of organising the metadata is going to work for them

waiting on go ahead from @rolando-ebi to complete the large James Colon submission, I won't be in tomorrow though so might be up to someone else to hit submit

mshadbolt commented 4 years ago

Ok I am going to try to explain this better. The example I will use is 3864ce3e-280d-4028-b633-bba5f4452c95_2020-06-09T09/13/52.966923Z_c5ff6a9f-723d-4ccf-8944-d89413fd86d0.json

Here is the graph structure in a diagram from the folder that might help visualise (i didnt draw it): https://docs.google.com/drawings/d/1hVdAdf0ED4AkzHC_ryXIv9hnvXxDiPVda_RAMBrE6Es/edit

(I am ignoring protocols for now because they are fine)

In this example we have links that go Donor->specimen->cell line->organoid->cell suspension->seq files

because (i think) we tend to build the experimental graph from left to right, we end up with the following links

input	output
1 donor	1 specimen
1 specimen	1 cell line
1 cell line	1 organoid
4 organoids	1 cell suspension
1 cell suspension	6 sequencing files

But because the organoids were pooled, if you built the graph from right to left you would end up with

input	output
4 donors	4 specimens
4 specimens	4 cell lines
4 cell lines	4 organoids
4 organoids	1 cell suspension
1 cell suspension	6 sequencing files

So I don't know if that is an issue or not, do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?

Does this make more sense? Or am I crazy?

mshadbolt commented 4 years ago

actually thinking about it more I guess the way we are doing it is a bit wrong because every donor, specimen and cell line is actually in every sequencing file... hmmm

I am not really sure if we should even accept pooled files like this but wasn't involved in the original wrangling

mshadbolt commented 4 years ago

honestly I think I chose a complicated example and the issues I have surfaced are not to do with the exporter @rolando-ebi

rolando-ebi commented 4 years ago

do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?

The assumption is from the perspective of the sequencing files i.e from individual assay process. I think things look ok from that perspective but we might want to review how we want to export.

Or to put another way: given an assay process, export all of the "dependent" experimental metadata

mshadbolt commented 4 years ago

yeah I was thinking of a scenario in which I did a search, saw that there was a donor I was interested in, looked at the links.json that contains that donor, currently I might assume that there was just data from 1 donor in the sequence files that are within that links.json, unless I was really paying attention and noticed that the donor is only linked to 1 organoid, but 4 organoids when into the cell suspension.

but maybe azul has some smart way of interpreting these relationships.

rolando-ebi commented 4 years ago

@mshadbolt I've described how I'd expect links.json to be generated here: https://docs.google.com/drawings/d/1pf2ZlJQ0Ip0nZW5g5nSCLQj4rsHi-ko8EoyWZNbfD4U/edit?usp=sharing

Each colour groups eligible files for a single assay-process/links.json

rays22 commented 4 years ago

@mshadbolt I've described how I'd expect links.json to be generated here: https://docs.google.com/drawings/d/1pf2ZlJQ0Ip0nZW5g5nSCLQj4rsHi-ko8EoyWZNbfD4U/edit?usp=sharing

Each colour groups eligible files for a single assay-process/links.json

@rolando-ebi 's grouping of the entities looks correct to me.

rays22 commented 4 years ago

So I don't know if that is an issue or not, do users/other components expect to see all metadata linked from the perspective of the sequencing file? Or from the donor?

I also share @mshadbolt 's concerns about supporting this type of pooled data in the future. I think the interoperability of the data in this set suffers from the experimental design of pooling cell suspensions the way they did. I am not sure if you could link the data from an individual cell to an individual donor from just the metadata in this set. I think you would have to do do variant calling and then may be able to demultiplex the individuals based on sequence polymorphisms. I think it is very likely that correctly displaying all this complexity clearly to the users will be tricky if not impossible.

mshadbolt commented 4 years ago

When we received cell type annotations for this project they had somehow demultiplexed the cell suspensions as the cells were assigned to individual organoids but am not sure how they did it and from the metadata that we provide it wouldn't be possible without going through some kind of process as Ray describes above.

@rolando-ebi I agree that I think it would be better to link all dependent entities from the sequence files going backwards as you drew in your diagram.

rays22 commented 4 years ago

I am trying to check Kylie James' colon immune project, has lots of files and is both 10x and SS2, but I do not seem to have access:

Access Denied

You cannot access the resource: /submissions/detail?uuid=eff0cb74-4dd5-42f3-beab-7f8b247d3cbc&project=9147fb5e-78cd-41cd-83b9-376c2b293334

Do I need to be added to the wrangler list on this server?

rays22 commented 4 years ago

Hi @rolando-ebi, Could you help me so I can check the validation status of the remaining test submission Kylie James' colon immune project, has lots of files and is both 10x and SS2? I do not seem to have access to it.

rays22 commented 4 years ago

Plan for the remaining test submission:

[x] Upload the project (metadata and data files) to staging.
[ ] Check if the PR fixing a blocking issue has been merged before hitting submit.

rays22 commented 4 years ago

I have a valid submission https://ui.ingest.staging.archive.data.humancellatlas.org/submissions/detail?uuid=fc3ae66a-bdc2-43d1-b3dd-e86a9d9c9569&project=e1774cf6-0acb-403a-ac01-a30c01929a3e waiting to be submitted.

rays22 commented 4 years ago

[ ] Check if the PR fixing a blocking issue has been merged before hitting submit.

@rolando-ebi , Please, let me know when it is OK for me to proceed with the testing.

clairerye commented 4 years ago

Closing as this will be super-seeded by the 'dry run' of all MVP datasets

ebi-ait / hca-ebi-wrangler-central

Upload projects to dev to test exporter #46