Open yoshihiko1218 opened 2 years ago
Hi @yoshihiko1218,
The "rep2" in metadata indicate the tubes with samples from both "rep1" and "rep2", as in Extended Data Fig. 2a. In the paper, we used histone targets as label for batch correction.
I see. So Rep1 and Rep2 in Fig 1a below are not equal to 1 and 2 in the rep column of Nuclei_metaData.xls, right?
How should they be converted? In more detail, how should I know which cells are under Rep1 and which cells are under Rep2 as indicated in the paper, given information in Nuclei_metaData.xls?
Thank you!
@yoshihiko1218 In Fig. 1e, only cells labelled with "Rep1" and "Rep2" were used to calculate the fraction of cells.
Correction of a typo to the last reply, " The 'rep3' in metadata indicate the tubes with samples from both 'rep1' and 'rep2' ".
The samples were labelled with "Sample Barcodes" (Extended Data Fig. 2a) and multiplexed, you can always identify the origin of samples from the "Sample Barcodes", as described here: https://github.com/cxzhu/Paired-Tag#3-merge-sub-libraries-for-downstream-analysis
I see, thank you! Although only cells labelled with "Rep1" and "Rep2" were used to calculate the fraction of cells, the fig. 1d was generated using all cells, including "Rep1", "Rep2", and "Rep3", is that correct? Is that same for Extended Fig. 3a? The right side of figure only include "Rep1" and "Rep2", but "Rep3" is also included on the left side of the figure?
Yes, fig 1D is generated with all cells. Indeed, only "Rep1" and "Rep2" are biological replicates, and "Rep3" is a technical replicate that has cells originating from both "Rep1" and "Rep2". Since we cannot identify the biological origin in the "Rep3", in both fig. 1e and fig. s3a, only biological replicates 1 and 2 were used to calculate the fractions of cells.
For Fig. S3a, the left-most panel has all the cells since it is the UMAP of all combined (and batch corrected) cells; the right panels (light-blue and pink UMAPs) only have cells can be clearly identifed as "Rep1" or "Rep2", that is, the cells with sample barcodes 01, 02, 04, 05, 07, 08, 10 and 11 in Fig. S2a.
I see. That make sence. Thank you for the clarification!
For the analysis process of DNA, is the data processing done in the following way: "Dimention reduction with all cells" -> "batch removal with histone targets" -> "Separate object by histone targets" -> "Run KNN and Clustering, visualization for each one separately"?
We only perform batch correction for RNA analysis. For each histone modification, they were carried out in a single experiment, and thus, technical biases are minimal. You can optionally perform the correction with the Rep labels "Rep1", "Rep2" and the pseudo replicate "Rep3", but we did not observe significant levels of batch effects between the two biological replicates (in each individual histone mark experiment).
I am not very clear what you are planning to do with process of DNA. In the paper, we performed: (1) cluster of all cells based on RNA (Fig.1 ) (matrix integration -> dimension reduction (PCA) -> batch correction -> KNN&clustering&visualization) (2) visualization of DNA for individual DNA modalities (Fig. 2a) (split cells by histone targets -> dimension reduction (PCA) -> transfer annotation from RNA-based clustering & visualization) and (3) compare of DNA, RNA and integrated clustering (Fig. 2c) (DNA: split cells by histone targets -> dimension reduction (PCA) -> KNN&clustering&visualization, RNA: use the results from clustering of all cells, integrated: split RNA cells by histone targets -> DNA&RNA matrix integration -> dimension reduction (PCA) with integrated matrix -> KNN&clustering&visualization). All these analysis started from pre-cleaned alignment&matrix files as described in the method section.
I hope these information can help, please let me know if you have any additional questions.
I see. That makes sense! Since in the paper, batch effect correction with harmony for DNA profiles was mentioned, but when I applied harmony by histone target, the result differed a lot from Extended Fig 4. I will go without batch correction for DNA then.
For the integrated part, did you also perform batch correction on PCA with the integrated matrix as in the paper? Also, I wonder if you would like to share the code for part of clustering single cells with joint modalities so I can check if my implementation according to the paper is correct.
Also, there's one small question about the visualization. Is Fig. 1f generated with some tool or just with original codes?
Thank you so much!
I have requested access to the archived data from the cold storage of my previous lab. Once mounted, I will upload the codes to http://renlab.sdsc.edu/chz272/Paired-Tag/R_codes/, which should be available before mid next week.
Fig. 1f was generated with IGV.
Thank you so much!!! Just for double check, for the integrated part, did you also perform batch correction on PCA with the integrated matrix as in the paper?
Code uploaded; please find the files from the link in the previous comment.
We did perform batch correction for the integrated matrix (with rep1-3 label); however, since the cells for the same histone marks are generated in the same experiments, we did not observe much difference with and without this step.
I see. Thank you so much!!! I will check the code and let you know in this thread if I have any further questions.
Hi Dr. Zhu,
Currently I am summarizing the information of processing and realized that for combine3 function, I got result of
Paired-seq/Tag Barcode Locator Report: GBM-paired-tag-RNA-RC1_220926_S14_L004
# total raw reads: 175747141
# of full barcoded reads: 133416235
# of DNA reads: 9790446
# of RNA reads: 119044624
# of und reads: 4581165
% of full barcode reads: -0.08%
the full barcode rate should be 133416235/175747141, right? However, the % seems to be incorrect. I checked the original code and I think it is calculated as following:
int aaa = pass*10000/total;
float ratio = (float)aaa / 100;
But I was not able to figure out why this can be negative. Would you like to see what might causing this or is it problem of my own data? Thank you!
Hi @yoshihiko1218,
Yes, the % is incorrect, please ignore it. It should be from the value range problem in C++, but since it does not affect the converting/extracting of reads, we did not change the original code for this part.
I see. Thank you!
Hi Dr. Zhu,
When I check my bam file that is generated by bowtie2, samtools sort, and reachtools rmdup2, I realized that reads with low MAPQ score exist. I wonder did you remove low MAPQ score reads in any of the following steps or just used them?
Best, Yoshii
Hi @yoshihiko1218,
That depends on what histone mark you are using. For some active histone marks, i.e., H3K27ac and H3K4me3, the low MAPQ reads can be removed but for repressive marks especially H3K9me3, lots of the reads do not have high MAPQ (mapped to repeats) but should be kept.
I see. How about H3K27Me3? It is also a repressive mark. If I want to do filtering on MAPQ, I can do it on bam file before generating matrix, right?
Hi Dr. Zhu,
I checked mapq distribution between two different histone, would you recommend removing them before applying analysis?
Thank you!
Hi @yoshihiko1218, this information is available in the Methods section of the paper.
Thank you! This filtering step by MAPQ > 10 is not included in the processing file provided, right? Also, I wonder if you did filtering for RNA, which I also saw some low-quality ones kept even though the ratio is very low.
Hi @yoshihiko1218, no the filtering of MAPQ is processed via samtools and is not included in the processing file and that should be considered based on the types of histone marks you are analyzing.
Hi Dr. Zhu,
I am particularly interested in the image of your article, I am recently reproducing the part of clustering single cells with joint modalities. Your impeccably crafted artwork created from code leaves me in such awe that I am inclined to humbly seek inspiration and knowledge from it. During the process of reproducing the images from your article, I came across this.it is a pity that the R code link you replied to above has failed, I wonder if you would like to upload the file again for me to learn. I would be immensely grateful if you could upload the codes again.
With heartfelt appreciation, Fanqing Luo
Hi @franksheeplaw,
Sorry for the broken link. The scripts are now uploaded to: https://github.com/cxzhu/Paired-Tag/tree/master/rscripts.
Hi Dr. Zhu,
While working on the provided data in original paper(GSE152020), I realized that there are a few different ways to group samples, including sublibraries, replicates(rep1, rep2, and rep3), and original tissue(FC and HC). I wonder for batch removal step in analysis, which variable is treated as the batch? Also, in Nuclei_metaData.xls provided, there are three reps according to rep column. However, in paper, there are only two replicates indicated in figures. Does that mean rep3 were not included in major analysis?
Thank you! Jiayan(Yoshii) Ma