cxzhu / Paired-Tag

Analysis of Paired-Tag datasets
MIT License
39 stars 14 forks source link

Batch for data in paper #16

Open yoshihiko1218 opened 2 years ago

yoshihiko1218 commented 2 years ago

Hi Dr. Zhu,

While working on the provided data in original paper(GSE152020), I realized that there are a few different ways to group samples, including sublibraries, replicates(rep1, rep2, and rep3), and original tissue(FC and HC). I wonder for batch removal step in analysis, which variable is treated as the batch? Also, in Nuclei_metaData.xls provided, there are three reps according to rep column. However, in paper, there are only two replicates indicated in figures. Does that mean rep3 were not included in major analysis?

Thank you! Jiayan(Yoshii) Ma

cxzhu commented 2 years ago

Hi @yoshihiko1218,

The "rep2" in metadata indicate the tubes with samples from both "rep1" and "rep2", as in Extended Data Fig. 2a. In the paper, we used histone targets as label for batch correction.

yoshihiko1218 commented 2 years ago

I see. So Rep1 and Rep2 in Fig 1a below are not equal to 1 and 2 in the rep column of Nuclei_metaData.xls, right? image

How should they be converted? In more detail, how should I know which cells are under Rep1 and which cells are under Rep2 as indicated in the paper, given information in Nuclei_metaData.xls?

Thank you!

cxzhu commented 2 years ago

@yoshihiko1218 In Fig. 1e, only cells labelled with "Rep1" and "Rep2" were used to calculate the fraction of cells.

Correction of a typo to the last reply, " The 'rep3' in metadata indicate the tubes with samples from both 'rep1' and 'rep2' ".

The samples were labelled with "Sample Barcodes" (Extended Data Fig. 2a) and multiplexed, you can always identify the origin of samples from the "Sample Barcodes", as described here: https://github.com/cxzhu/Paired-Tag#3-merge-sub-libraries-for-downstream-analysis

yoshihiko1218 commented 2 years ago

I see, thank you! Although only cells labelled with "Rep1" and "Rep2" were used to calculate the fraction of cells, the fig. 1d was generated using all cells, including "Rep1", "Rep2", and "Rep3", is that correct? Is that same for Extended Fig. 3a? The right side of figure only include "Rep1" and "Rep2", but "Rep3" is also included on the left side of the figure?

cxzhu commented 2 years ago

Yes, fig 1D is generated with all cells. Indeed, only "Rep1" and "Rep2" are biological replicates, and "Rep3" is a technical replicate that has cells originating from both "Rep1" and "Rep2". Since we cannot identify the biological origin in the "Rep3", in both fig. 1e and fig. s3a, only biological replicates 1 and 2 were used to calculate the fractions of cells.

For Fig. S3a, the left-most panel has all the cells since it is the UMAP of all combined (and batch corrected) cells; the right panels (light-blue and pink UMAPs) only have cells can be clearly identifed as "Rep1" or "Rep2", that is, the cells with sample barcodes 01, 02, 04, 05, 07, 08, 10 and 11 in Fig. S2a.

yoshihiko1218 commented 2 years ago

I see. That make sence. Thank you for the clarification!

For the analysis process of DNA, is the data processing done in the following way: "Dimention reduction with all cells" -> "batch removal with histone targets" -> "Separate object by histone targets" -> "Run KNN and Clustering, visualization for each one separately"?

cxzhu commented 2 years ago

We only perform batch correction for RNA analysis. For each histone modification, they were carried out in a single experiment, and thus, technical biases are minimal. You can optionally perform the correction with the Rep labels "Rep1", "Rep2" and the pseudo replicate "Rep3", but we did not observe significant levels of batch effects between the two biological replicates (in each individual histone mark experiment).

I am not very clear what you are planning to do with process of DNA. In the paper, we performed: (1) cluster of all cells based on RNA (Fig.1 ) (matrix integration -> dimension reduction (PCA) -> batch correction -> KNN&clustering&visualization) (2) visualization of DNA for individual DNA modalities (Fig. 2a) (split cells by histone targets -> dimension reduction (PCA) -> transfer annotation from RNA-based clustering & visualization) and (3) compare of DNA, RNA and integrated clustering (Fig. 2c) (DNA: split cells by histone targets -> dimension reduction (PCA) -> KNN&clustering&visualization, RNA: use the results from clustering of all cells, integrated: split RNA cells by histone targets -> DNA&RNA matrix integration -> dimension reduction (PCA) with integrated matrix -> KNN&clustering&visualization). All these analysis started from pre-cleaned alignment&matrix files as described in the method section.

I hope these information can help, please let me know if you have any additional questions.

yoshihiko1218 commented 2 years ago

I see. That makes sense! Since in the paper, batch effect correction with harmony for DNA profiles was mentioned, but when I applied harmony by histone target, the result differed a lot from Extended Fig 4. I will go without batch correction for DNA then.

For the integrated part, did you also perform batch correction on PCA with the integrated matrix as in the paper? Also, I wonder if you would like to share the code for part of clustering single cells with joint modalities so I can check if my implementation according to the paper is correct.

Also, there's one small question about the visualization. Is Fig. 1f generated with some tool or just with original codes?

Thank you so much!

cxzhu commented 2 years ago

I have requested access to the archived data from the cold storage of my previous lab. Once mounted, I will upload the codes to http://renlab.sdsc.edu/chz272/Paired-Tag/R_codes/, which should be available before mid next week.

Fig. 1f was generated with IGV.

yoshihiko1218 commented 2 years ago

Thank you so much!!! Just for double check, for the integrated part, did you also perform batch correction on PCA with the integrated matrix as in the paper?

cxzhu commented 2 years ago

Code uploaded; please find the files from the link in the previous comment.

We did perform batch correction for the integrated matrix (with rep1-3 label); however, since the cells for the same histone marks are generated in the same experiments, we did not observe much difference with and without this step.

yoshihiko1218 commented 2 years ago

I see. Thank you so much!!! I will check the code and let you know in this thread if I have any further questions.

yoshihiko1218 commented 2 years ago

Hi Dr. Zhu,

Currently I am summarizing the information of processing and realized that for combine3 function, I got result of

Paired-seq/Tag Barcode Locator Report: GBM-paired-tag-RNA-RC1_220926_S14_L004
# total raw reads:              175747141
# of full barcoded reads:       133416235
# of DNA reads:                 9790446
# of RNA reads:                 119044624
# of und reads:                 4581165
% of full barcode reads:        -0.08%

the full barcode rate should be 133416235/175747141, right? However, the % seems to be incorrect. I checked the original code and I think it is calculated as following:

int aaa = pass*10000/total;
float ratio = (float)aaa / 100;

But I was not able to figure out why this can be negative. Would you like to see what might causing this or is it problem of my own data? Thank you!

cxzhu commented 2 years ago

Hi @yoshihiko1218,

Yes, the % is incorrect, please ignore it. It should be from the value range problem in C++, but since it does not affect the converting/extracting of reads, we did not change the original code for this part.

yoshihiko1218 commented 2 years ago

I see. Thank you!

yoshihiko1218 commented 2 years ago

Hi Dr. Zhu,

When I check my bam file that is generated by bowtie2, samtools sort, and reachtools rmdup2, I realized that reads with low MAPQ score exist. I wonder did you remove low MAPQ score reads in any of the following steps or just used them?

Best, Yoshii

cxzhu commented 2 years ago

Hi @yoshihiko1218,

That depends on what histone mark you are using. For some active histone marks, i.e., H3K27ac and H3K4me3, the low MAPQ reads can be removed but for repressive marks especially H3K9me3, lots of the reads do not have high MAPQ (mapped to repeats) but should be kept.

yoshihiko1218 commented 2 years ago

I see. How about H3K27Me3? It is also a repressive mark. If I want to do filtering on MAPQ, I can do it on bam file before generating matrix, right?

yoshihiko1218 commented 1 year ago

Hi Dr. Zhu,

I checked mapq distribution between two different histone, would you recommend removing them before applying analysis? image image

Thank you!

cxzhu commented 1 year ago

Hi @yoshihiko1218, this information is available in the Methods section of the paper.

yoshihiko1218 commented 1 year ago

Thank you! This filtering step by MAPQ > 10 is not included in the processing file provided, right? Also, I wonder if you did filtering for RNA, which I also saw some low-quality ones kept even though the ratio is very low. image

cxzhu commented 1 year ago

Hi @yoshihiko1218, no the filtering of MAPQ is processed via samtools and is not included in the processing file and that should be considered based on the types of histone marks you are analyzing.

franksheeplaw commented 1 year ago

Hi Dr. Zhu,

I am particularly interested in the image of your article, I am recently reproducing the part of clustering single cells with joint modalities. Your impeccably crafted artwork created from code leaves me in such awe that I am inclined to humbly seek inspiration and knowledge from it. During the process of reproducing the images from your article, I came across this.it is a pity that the R code link you replied to above has failed, I wonder if you would like to upload the file again for me to learn. I would be immensely grateful if you could upload the codes again.

With heartfelt appreciation, Fanqing Luo

cxzhu commented 1 year ago

Hi @franksheeplaw,

Sorry for the broken link. The scripts are now uploaded to: https://github.com/cxzhu/Paired-Tag/tree/master/rscripts.