Naming conventions in the raw reads file.

cakeinspace commented 1 year ago

In the Palantir manuscript it says that three replicates were collected for the HSCs. The raw reads deposited in ENA contains 12 pairs of read files. I give the list below.

BM1_IGO_07465_1_S58_L007_R1_001.fastq.gz BM1_IGO_07465_1_S58_L007_R2_001.fastq.gz

BM2_IGO_07465_2_S57_L007_R1_001.fastq.gz BM2_IGO_07465_2_S57_L007_R2_001.fastq.gz

BoneMarrow_CD34_1_IGO_07861_1_S1_L001_R1_001.fastq.gz BoneMarrow_CD34_1_IGO_07861_1_S1_L001_R2_001.fastq.gz

BoneMarrow_CD34_3_IGO_07861_3_S3_L002_R1_001.fastq.gz BoneMarrow_CD34_3_IGO_07861_3_S3_L002_R2_001.fastq.gz

BoneMarrow_CD34_4_IGO_07861_4_S4_L002_R1_001.fastq.gz BoneMarrow_CD34_4_IGO_07861_4_S4_L002_R2_001.fastq.gz

BM3_IGO_07465_3_S60_L008_R1_001.fastq.gz BM3_IGO_07465_3_S60_L008_R2_001.fastq.gz

BM2_10x_SI-GA-C12_R1.fastq.gz BM2_10x_SI-GA-C12_R2.fastq.gz

Run4_SI-GA-H11_R1.fastq.gz Run4_SI-GA-H11_R2.fastq.gz

BM2_10x_SI-GA-D12_R1.fastq.gz
BM2_10x_SI-GA-D12_R2.fastq.gz

BM2_10x_SI-GA-F12_R1.fastq.gz
BM2_10x_SI-GA-F12_R2.fastq.gz

Run5_SI-GA-D10_R1.fastq.gz Run5_SI-GA-D10_R2.fastq.gz

BoneMarrow_CD34_2_IGO_07861_2_S2_L001_R1_001.fastq.gz
BoneMarrow_CD34_2_IGO_07861_2_S2_L001_R2_001.fastq.gz

My question is which of these files are groups of replicates. Since there are 3 replicates I am assuming that each replicate was split between multiple replicates. The library name contains values like HS_BM_P3_cells_1 and so on. Can I assume that P3 or P2 or P1 are the replicates mentioned. For e.g. HS_BM_P3_cells has four experiments in the ena archive. Can i assume that each experiment was just a uniform mixture of different cell types right. Each different experiment has not been enriched for any fraction. Am I correct. Please let me know and thanks a lot for your help!!!

Regards cakeinspace

ManuSetty commented 1 year ago

Thank you for your query. Yes, P1, P2, P3 do represent the three replicates. There are multiple files within each replicate since we used multiple 10X channels to boost the number of cells per replicate. Each replicate represents CD34+ cells from a different donor. The enrichment is only CD34 to represent hematopoietic stem and progenitor cells. Each experiment within the replicate should contain uniform mixture of cells and any differences are purely technical.

cakeinspace commented 1 year ago

Ah cool thanks a lot !!!

cakeinspace commented 1 year ago

Hey just another doubt when you say replicates. The sample prep is done on all of the P2 cells and then they are only split into 6 different batches during sequencing or are they split into 6 replicates and then sample prepped and sequenced. The reason I am asking is that there are around 6000 barcodes in the cell_annotations.tsv file on the HCA database. and this cell annotation has barcode and cell_suspension.biomaterial_core.biomaterial_id field. The cell_suspension.biomaterial_core.biomaterial_id field takes on 3 values.

I assembled the raw reads and then now would like to figure out the cell annotation that was assigned to these cells. So the information I have are the barcode sequence and the read files they were assembled from. From what I understand the barcode can be same between different samples. So in my barcode sequences I have repeated barcodes.

I now want to figure out the barcode + read file mapping to the barcode + cell_suspension id in the cell annotations file that you provide on the HCA database. Thanks a lot for your patience and sorry to bug you about this

I find that the barcodes within HS_BM_P2_cells_1 field are probably the same sample prep. So what my understanding is that we drew blood from P2. We split into 3 groups. In each group we do sample prep and then we split each group into 2 and run different sequencing on these 2 subgroups within each of the 3 groups

ManuSetty commented 1 year ago

Happy to clarify

" I find that the barcodes within HS_BM_P2_cells_1 field are probably the same sample prep. So what my understanding is that we drew blood from P2. We split into 3 groups. In each group we do sample prep and then we split each group into 2 and run different sequencing on these 2 subgroups within each of the 3 groups "

This is correct: The cells are from bone marrow (not blood draw) - we just purchased a frozen vial of CD34+ sorted cells from AllCells. The cells were thawed and brought into suspension. They were then split into 3 groups, which form the technical replicates for P2. The three groups of cells were run on separate 10X channel. Following library prep, each of three groups were sequenced with two different sequencing barcodes.

cakeinspace commented 1 year ago

Nice thanks a lot. Sorry for the confusion.

dpeerlab / Palantir

Naming conventions in the raw reads file. #101