Broad cell type clustering from TE counts alone

Hi, thanks for such a great tool.

I have a scATAC-seq dataset comprised of 65 cancer samples. 155,329 of these cells have TE counts from running the scTE pipeline.

As a sanity check, I wanted to see if clustering on TE counts alone lined up with broad cell types annotated previously using peak-level and gene-level accessibility.

I did not see any clustering by cell type, and it almost seemed like the features were entirely uninformative of cell type. Below is the code I used

Add scte counts matrix to seurat object and normalize

allcells_scte[['scte']] <- CreateAssayObject(counts = scte_results)
allcells_scte <- NormalizeData(
  object = allcells_scte,
  assay = 'scte',
  normalization.method = 'LogNormalize',
    scale.factor = median(allcells_scte$nCount_scte))

Scale normalized data and run PCA using all features

DefaultAssay(allcells_scte) = 'scte'
all.genes <- rownames(allcells_scte)
allcells_scte <- ScaleData(allcells_scte, features = all.genes)
allcells_scte = RunPCA(allcells_scte, features = row.names(allcells_scte@assays$scte@counts), reduction.name = 'pca_scte')

Create NN graph, run clustering algorithm, UMAP

allcells_scte <- FindNeighbors(allcells_scte, dims = 1:30, reduction = 'pca_scte')
allcells_scte <- FindClusters(allcells_scte, resolution = 0.1)
allcells_scte <- RunUMAP(allcells_scte, dims = 1:30,reduction = 'pca_scte',reduction.name = 'umap_scte')

Visualize UMAP, colored by previous cell type annotation

DimPlot(allcells_scte, group.by = 'broad_celltype_excluded', reduction = 'umap_scte')

For reference, this is the UMAP created from peak accessibility counts and colored by broad cell type.

I also tried using the scTE counts from just one sample, attaching that counts matrix here in case you recognize any obvious issues.

Read in TE counts, create seurat object, normalize counts matrix

counts = read.table('data/inhouse_wu_long_yu/scte/preprocessing/scte_rawoutput/0600909_T1_GRCh38-2020-A-2.0.0.csv', sep = ',', header = TRUE)
obj = CreateSeuratObject(counts = t(counts))
obj <- NormalizeData(obj, normalization.method = "LogNormalize", scale.factor = median(obj@meta.data$nCount_RNA))

Scale matrix, run PCA, NN graph, clustering, UMAP

all.genes <- rownames(obj)
obj <- ScaleData(obj, features = all.genes)
obj <- RunPCA(obj, features = all.genes)
obj <- FindNeighbors(obj, dims = 1:20)
obj <- FindClusters(obj, resolution = 0.5)
obj <- RunUMAP(obj, dims = 1:20)
DimPlot(obj, reduction = "umap", group.by = 'broad_celltype_excluded')

Again, there is no localization or clustering by broad cell type. Below is the UMAP from the peak accessibility counts annotated by broad cell type. 0600909_T1_GRCh38-2020-A-2.0.0.csv.zip

Any idea on what could be going wrong? I would appreciate any and all advice!

Best, Sabrina

In the above analysis, we used the CellRanger ATAC count produced BAM, possorted_bam.bam, as the input to scTEATAC.

After troubleshooting, we realized that CellRanger ATAC does not appear to set a maximum fragment length for paired-end alignments. The authors used bowtie2 with parameter -X2000 which sets that maximum distance to 2000 base pairs. Since these extremely long likely artifactual fragments overlap several TE locations, if they are not filtered they contribute a large share of TE counts for a given barcode. This results in TE counts not representing meaningful biological variation between cells.

@chenyenchung forked this repository and added a function allowing 10x-style barcoded BAMs to be used with scTEATAC without needing to incorporate the barcode into the read name beforehand. We forked this repository and included another function to filter the BED file to remove fragments over a given length. After re-running scTEATAC with this filter and processing the counts matrix, we saw clear clustering by broad cell type.

JiekaiLab / scTE