Open arschat opened 1 year ago
Depending on the dataset there are multiple cases.
We may have sequence files / analysis files. Both can be per cell_suspension / pooled. Analysis files could contain also the per cell sample_id (cell metadata, in separate file or in cell index) / no sample_id.
For all those cases, we can map the analysis files (and the corresponding cell barcodes) to a cell suspension, and then extract all the other relevant metadata for each cell.
In 1. we can directly extract the cell barcodes from each analysis file, and directly map the relevant CS. In 2. we need to map the provided sample_id with the relevant CS ID that we have, and then we have cells mapped to CS. In 3. we must extract the cell barcode from the fastq files and then map the corresponding cell ID to one specific CS. In 4. it is similar to 3. but we don't have the list of cell barcodes that will be incorporated.
Datasets from the following bionetworks were investigated for their availability in this spreadsheet
Column Type
is one of the aforementioned 4 types of availability.
In order to check the sample_id existence more investigation is needed. As well when we have pooled files AND per CS files, it is possible that we have separate count matrices for each CS and 1 summary count matrix, but more investigation is needed.
Concerning the 3 and 4 types of availability. Cell barcode extraction from fastq file can be done using the information that we have in our metadata schema (cell barcode). An investigation was done in this dataset (that has fastq files & analysis files while the size of fastq files is not extreme).
We have 5 different cell suspensions (samples) and we pooled our barcodes into 1 analysis file with 20008 cells in order to identify the cell barcodes of the analysis file. In our investigation we found that we get multiple cell barcodes from each fastq file, because a large amount of cell barcodes is filtered out in the cellranger pipeline. When we filter out the cell barcodes of each fastq with lowest number of reads to 0, 100, 500, 1000, 1200, 1500, 5000, 10000 and count the number of samples that the barcode still exists, we get this result.
What we could do from now on:
Post Tony/Gabby meeting actions:
Stats in spreadsheet were wrong, due to no type prioritisation.
Type | Wrong stats |
---|---|
1 | 1 |
3 | 30 |
4 | 15 |
more investigation | 12 |
Not available | 25 |
If dataset is in type 1 and 4, we should show type 1 instead of 4. the opposite was done After more investigation on DCP and not only on ingest most of the type 4 projects were promoted to higher type.
Updated stats:
Type | # projects | Type description |
---|---|---|
1 | 34 | analysis files per CS |
2 | 37 | analysis files with sample_id (in separate file or in cell_id) |
3 | 2 | analysis files pooled and seq files per CS |
4 | 1 | seq files per CS |
total eligible | 74 | all eligible projects |
partial | 5 | part of data can be un-pooled |
cannot unpool | 3 | not enough info for CS info extraction |
No submission | 13 | no submission in DCP or ingest |
Kidney example of cell metadata spreadsheet
The specific project was selected due to the already known list of desired metadata from the kidney bionetwork.
It is a type 1
project where it has 6 analysis files and 28 sequence files, all per Cell Suspension.
The authors mention 23,980 cells in their final count matrix, however, the raw, unfiltered and un-merged count matrices that are publicly available have 121,787 cells. Given that, we expect that the final merged objects will have an identifier before or after the barcode, in order to avoid duplicate barcode IDs.
There are currently multiple ways to add this identifier (i.e. addition of a sample_id before the barcode NL2_R_TTTGTCAAGCGATTCT-1
, 10X_087_AAACCTGTCCGAATGT-1
, 1.1_AAACGAACAACGACAG-1
, 0_AAACCCAAGCATTGAA-1
, addition of a counter after the barcode AAACCCAAGCATTGAA-1
, AAACCCAAGCATTGAA-2
or AAACCCAAGCGTTCAT-1_1
, these are examples from CxG analysis files).
Given that we will not have the final merged analysis file, what we can offer is a specific barcode, with all the corresponding metadata that we may have including the sample_id (either CS, Specimen or Donor), and then it will be easy for the end user to match the given identifier with one of all the sample metadata.
For this specific project, I extracted all the cell_barcodes from each count matrix (one for each of the 6 CS) into a csv file (R script), and using the DCP metadata spreadsheet, I extracted all the given metadata in this spreadsheet (jupyter notebook).
It has to be noted that the given project did not contain analysis_file metadata due to a known problem in a specific number of old datasets, and a fast curation was done for the purpose of this task.
Finally, it has to be stated that the procedure for now, is not fully automated, and a supervision on the script is needed (different objects have different ways to extract barcodes, metadata needs to be traced back from files to donors and each project has a different experimental design).
Lung example of cell metadata spreadsheet
Bug: mouse specimens are pooled. Script to extract metadata needs to be updated otherwise metadata upwards this pooling are not filled.
Actions after meeting with David, Tony, Gabby:
Created 2 CxG h5ad objects with DCP metadata (LungStromaEmphysema and KidneySexBasedTranscriptome), and a small tutorial on how to cast the file on a CxG instance.
Completed 3rd CxG h5ad object with DCP metadata (Landscape-ileum-colon). DCP metadata miss one donor (104152
), two specimens and two cell suspensions. Contacted USCS for updating but these objects do not have these metadata.
Found a script that flattens metadata here. Does not allow pooled fields even if it is protocol (i.e. multiple enrichment protocols means multiple rows. We only want this in biomaterial fields)
Next steps:
Create a repo with current work done
Further steps
As an HCA integration team member, I want to be able to incorporate DCP metadata into my integration analysis, even when I use only the count matrices.
Scripts for this project are uploaded here before a complete first draft. Acceptance Criteria / Definition of Done