ebi-ait / dcp-ingest-central

Central point of access for the Ingestion Service of the HCA DCP
Apache License 2.0
0 stars 0 forks source link

Mapping metadata at the cell level #968

Open arschat opened 1 year ago

arschat commented 1 year ago

As an HCA integration team member, I want to be able to incorporate DCP metadata into my integration analysis, even when I use only the count matrices.

Scripts for this project are uploaded here before a complete first draft. Acceptance Criteria / Definition of Done

arschat commented 1 year ago

Depending on the dataset there are multiple cases.

We may have sequence files / analysis files. Both can be per cell_suspension / pooled. Analysis files could contain also the per cell sample_id (cell metadata, in separate file or in cell index) / no sample_id.

  1. Analysis files per CS
  2. Analysis files pooled but sample_id provided
  3. Analysis files pooled but seq files per CS
  4. Only Seq files per CS

For all those cases, we can map the analysis files (and the corresponding cell barcodes) to a cell suspension, and then extract all the other relevant metadata for each cell.

In 1. we can directly extract the cell barcodes from each analysis file, and directly map the relevant CS. In 2. we need to map the provided sample_id with the relevant CS ID that we have, and then we have cells mapped to CS. In 3. we must extract the cell barcode from the fastq files and then map the corresponding cell ID to one specific CS. In 4. it is similar to 3. but we don't have the list of cell barcodes that will be incorporated.

Datasets from the following bionetworks were investigated for their availability in this spreadsheet

Column Type is one of the aforementioned 4 types of availability. In order to check the sample_id existence more investigation is needed. As well when we have pooled files AND per CS files, it is possible that we have separate count matrices for each CS and 1 summary count matrix, but more investigation is needed.

Concerning the 3 and 4 types of availability. Cell barcode extraction from fastq file can be done using the information that we have in our metadata schema (cell barcode). An investigation was done in this dataset (that has fastq files & analysis files while the size of fastq files is not extreme).

We have 5 different cell suspensions (samples) and we pooled our barcodes into 1 analysis file with 20008 cells in order to identify the cell barcodes of the analysis file. In our investigation we found that we get multiple cell barcodes from each fastq file, because a large amount of cell barcodes is filtered out in the cellranger pipeline. Screenshot 2023-07-31 at 19.36.42.png When we filter out the cell barcodes of each fastq with lowest number of reads to 0, 100, 500, 1000, 1200, 1500, 5000, 10000 and count the number of samples that the barcode still exists, we get this result.

Screenshot 2023-08-01 at 11 57 10

What we could do from now on:

arschat commented 1 year ago

Post Tony/Gabby meeting actions:

arschat commented 1 year ago

Stats in spreadsheet were wrong, due to no type prioritisation.

Type Wrong stats
1 1
3 30
4 15
more investigation 12
Not available 25

If dataset is in type 1 and 4, we should show type 1 instead of 4. the opposite was done After more investigation on DCP and not only on ingest most of the type 4 projects were promoted to higher type.

Updated stats:

Type # projects Type description
1 34 analysis files per CS
2 37 analysis files with sample_id (in separate file or in cell_id)
3 2 analysis files pooled and seq files per CS
4 1 seq files per CS
total eligible 74 all eligible projects
partial 5 part of data can be un-pooled
cannot unpool 3 not enough info for CS info extraction
No submission 13 no submission in DCP or ingest
arschat commented 1 year ago

Kidney example of cell metadata spreadsheet

The specific project was selected due to the already known list of desired metadata from the kidney bionetwork. It is a type 1 project where it has 6 analysis files and 28 sequence files, all per Cell Suspension.

The authors mention 23,980 cells in their final count matrix, however, the raw, unfiltered and un-merged count matrices that are publicly available have 121,787 cells. Given that, we expect that the final merged objects will have an identifier before or after the barcode, in order to avoid duplicate barcode IDs. There are currently multiple ways to add this identifier (i.e. addition of a sample_id before the barcode NL2_R_TTTGTCAAGCGATTCT-1, 10X_087_AAACCTGTCCGAATGT-1, 1.1_AAACGAACAACGACAG-1, 0_AAACCCAAGCATTGAA-1, addition of a counter after the barcode AAACCCAAGCATTGAA-1, AAACCCAAGCATTGAA-2 or AAACCCAAGCGTTCAT-1_1, these are examples from CxG analysis files). Given that we will not have the final merged analysis file, what we can offer is a specific barcode, with all the corresponding metadata that we may have including the sample_id (either CS, Specimen or Donor), and then it will be easy for the end user to match the given identifier with one of all the sample metadata.

For this specific project, I extracted all the cell_barcodes from each count matrix (one for each of the 6 CS) into a csv file (R script), and using the DCP metadata spreadsheet, I extracted all the given metadata in this spreadsheet (jupyter notebook).

It has to be noted that the given project did not contain analysis_file metadata due to a known problem in a specific number of old datasets, and a fast curation was done for the purpose of this task.

Finally, it has to be stated that the procedure for now, is not fully automated, and a supervision on the script is needed (different objects have different ways to extract barcodes, metadata needs to be traced back from files to donors and each project has a different experimental design).

arschat commented 1 year ago

Lung example of cell metadata spreadsheet

Bug: mouse specimens are pooled. Script to extract metadata needs to be updated otherwise metadata upwards this pooling are not filled.

arschat commented 1 year ago

Presentation link

Actions after meeting with David, Tony, Gabby:

arschat commented 1 year ago

Created 2 CxG h5ad objects with DCP metadata (LungStromaEmphysema and KidneySexBasedTranscriptome), and a small tutorial on how to cast the file on a CxG instance.

arschat commented 1 year ago

Completed 3rd CxG h5ad object with DCP metadata (Landscape-ileum-colon). DCP metadata miss one donor (104152), two specimens and two cell suspensions. Contacted USCS for updating but these objects do not have these metadata.

arschat commented 1 year ago

Found a script that flattens metadata here. Does not allow pooled fields even if it is protocol (i.e. multiple enrichment protocols means multiple rows. We only want this in biomaterial fields)

Next steps:

  1. Create a repo with current work done

    • [ ] Create a README.md that describes the problem and the solution
    • [ ] Create distinctive scripts with clear input-output arguments that:
      1. Collects automatically the cell barcode information
      2. Assigns the cell barcode information to a specific sample_ID
      3. Flatten DCP metadata spreadsheeet
      4. Merge the flatten spreadsheet with the sample_ID information
    • [ ] Provide clear instructions to run the scripts
  2. Further steps

    • [ ] Investigate if Amnon's script (mentioned before) can include pooled cases
    • Automate the cell_barcode extraction for:
      • [ ] h5ad
      • [ ] Seurat (rds or RData)
      • [ ] SingleCellExperiment objects (rds or RData)
      • [ ] h5
      • [ ] tsv, csv count matrix