Mapping metadata at the cell level

arschat commented 1 year ago

As an HCA integration team member, I want to be able to incorporate DCP metadata into my integration analysis, even when I use only the count matrices.

Scripts for this project are uploaded here before a complete first draft. Acceptance Criteria / Definition of Done

Given that I have a DCP project that contains sequence files (and maybe pooled analysis files)
When I run the script
Then I should get a table of barcodes mapped to fastq file and/or cell suspension

arschat commented 1 year ago

Depending on the dataset there are multiple cases.

We may have sequence files / analysis files. Both can be per cell_suspension / pooled. Analysis files could contain also the per cell sample_id (cell metadata, in separate file or in cell index) / no sample_id.

Analysis files per CS
Analysis files pooled but sample_id provided
Analysis files pooled but seq files per CS
Only Seq files per CS

For all those cases, we can map the analysis files (and the corresponding cell barcodes) to a cell suspension, and then extract all the other relevant metadata for each cell.

In 1. we can directly extract the cell barcodes from each analysis file, and directly map the relevant CS. In 2. we need to map the provided sample_id with the relevant CS ID that we have, and then we have cells mapped to CS. In 3. we must extract the cell barcode from the fastq files and then map the corresponding cell ID to one specific CS. In 4. it is similar to 3. but we don't have the list of cell barcodes that will be incorporated.

Datasets from the following bionetworks were investigated for their availability in this spreadsheet

Kidney
Gut
Eye
Lung

Column Type is one of the aforementioned 4 types of availability. In order to check the sample_id existence more investigation is needed. As well when we have pooled files AND per CS files, it is possible that we have separate count matrices for each CS and 1 summary count matrix, but more investigation is needed.

Concerning the 3 and 4 types of availability. Cell barcode extraction from fastq file can be done using the information that we have in our metadata schema (cell barcode). An investigation was done in this dataset (that has fastq files & analysis files while the size of fastq files is not extreme).

We have 5 different cell suspensions (samples) and we pooled our barcodes into 1 analysis file with 20008 cells in order to identify the cell barcodes of the analysis file. In our investigation we found that we get multiple cell barcodes from each fastq file, because a large amount of cell barcodes is filtered out in the cellranger pipeline. Screenshot 2023-07-31 at 19.36.42.png When we filter out the cell barcodes of each fastq with lowest number of reads to 0, 100, 500, 1000, 1200, 1500, 5000, 10000 and count the number of samples that the barcode still exists, we get this result.

What we could do from now on:

[ ] find the biggest number of filter that does not give 0 mapping and have ambiguity for the rest of the cells that do not have a unique mapping.
[ ] try to reproduce the filtering that cell ranger executes

arschat commented 1 year ago

Post Tony/Gabby meeting actions:

[x] create a kidney and a lung example of flattened metadata to the barcode level based on this template
[ ] execute the code for the 30+15 datasets of type 3,4 availability and extract the barcodes/CS in separate files (percise and ambigious)

arschat commented 1 year ago

Stats in spreadsheet were wrong, due to no type prioritisation.

Type	Wrong stats
1	1
3	30
4	15
more investigation	12
Not available	25

If dataset is in type 1 and 4, we should show type 1 instead of 4. the opposite was done After more investigation on DCP and not only on ingest most of the type 4 projects were promoted to higher type.

Updated stats:

Type	# projects	Type description
1	34	analysis files per CS
2	37	analysis files with sample_id (in separate file or in cell_id)
3	2	analysis files pooled and seq files per CS
4	1	seq files per CS
*total eligible*	74	*all eligible projects*
partial	5	part of data can be un-pooled
cannot unpool	3	not enough info for CS info extraction
No submission	13	no submission in DCP or ingest

arschat commented 1 year ago

Kidney example of cell metadata spreadsheet

The specific project was selected due to the already known list of desired metadata from the kidney bionetwork. It is a type 1 project where it has 6 analysis files and 28 sequence files, all per Cell Suspension.

The authors mention 23,980 cells in their final count matrix, however, the raw, unfiltered and un-merged count matrices that are publicly available have 121,787 cells. Given that, we expect that the final merged objects will have an identifier before or after the barcode, in order to avoid duplicate barcode IDs. There are currently multiple ways to add this identifier (i.e. addition of a sample_id before the barcode NL2_R_TTTGTCAAGCGATTCT-1, 10X_087_AAACCTGTCCGAATGT-1, 1.1_AAACGAACAACGACAG-1, 0_AAACCCAAGCATTGAA-1, addition of a counter after the barcode AAACCCAAGCATTGAA-1, AAACCCAAGCATTGAA-2 or AAACCCAAGCGTTCAT-1_1, these are examples from CxG analysis files). Given that we will not have the final merged analysis file, what we can offer is a specific barcode, with all the corresponding metadata that we may have including the sample_id (either CS, Specimen or Donor), and then it will be easy for the end user to match the given identifier with one of all the sample metadata.

For this specific project, I extracted all the cell_barcodes from each count matrix (one for each of the 6 CS) into a csv file (R script), and using the DCP metadata spreadsheet, I extracted all the given metadata in this spreadsheet (jupyter notebook).

It has to be noted that the given project did not contain analysis_file metadata due to a known problem in a specific number of old datasets, and a fast curation was done for the purpose of this task.

Finally, it has to be stated that the procedure for now, is not fully automated, and a supervision on the script is needed (different objects have different ways to extract barcodes, metadata needs to be traced back from files to donors and each project has a different experimental design).

arschat commented 1 year ago

Lung example of cell metadata spreadsheet

Bug: mouse specimens are pooled. Script to extract metadata needs to be updated otherwise metadata upwards this pooling are not filled.

arschat commented 1 year ago

Presentation link

Actions after meeting with David, Tony, Gabby:

[x] Using cellxgene container, create examples of projects with DCP rich metadata (full and slim) to visualise all our metadata on CxG
- [x] Choose 3 datasets, one from Kidney, Lung and Gut and generate an h5ad file with the obs DCP metadata
- [ ] Create CxG container to include the 3 datasets
[x] Use User friendly names instead of programmatic

arschat commented 1 year ago

Created 2 CxG h5ad objects with DCP metadata (LungStromaEmphysema and KidneySexBasedTranscriptome), and a small tutorial on how to cast the file on a CxG instance.

arschat commented 1 year ago

Completed 3rd CxG h5ad object with DCP metadata (Landscape-ileum-colon). DCP metadata miss one donor (104152), two specimens and two cell suspensions. Contacted USCS for updating but these objects do not have these metadata.

arschat commented 1 year ago

Found a script that flattens metadata here. Does not allow pooled fields even if it is protocol (i.e. multiple enrichment protocols means multiple rows. We only want this in biomaterial fields)

Next steps:

Create a repo with current work done
- [ ] Create a README.md that describes the problem and the solution
- [ ] Create distinctive scripts with clear input-output arguments that:
  1. Collects automatically the cell barcode information
  2. Assigns the cell barcode information to a specific sample_ID
  3. Flatten DCP metadata spreadsheeet
  4. Merge the flatten spreadsheet with the sample_ID information
- [ ] Provide clear instructions to run the scripts
Further steps
- [ ] Investigate if Amnon's script (mentioned before) can include pooled cases
- Automate the cell_barcode extraction for:
  - [ ] h5ad
  - [ ] Seurat (rds or RData)
  - [ ] SingleCellExperiment objects (rds or RData)
  - [ ] h5
  - [ ] tsv, csv count matrix

ebi-ait / dcp-ingest-central

Mapping metadata at the cell level #968