JiekaiLab / scTE

MIT License
96 stars 27 forks source link

No mitochondrial gene detected after running scTE #99

Open synnimeng opened 1 month ago

synnimeng commented 1 month ago

Hi! I really appreciate for developing this powerful tool. I met a puzzling event - there were no any MT-gene detected in some of my scTE csv files, which could be found in starsolo matrix. Why did that append? And how can I fix it? It's hard for me to do downstream quality control without MT-genes.

synnimeng commented 1 month ago

Another question to bother you with: I wonder how you map reads to Genes and why the result is different from starsolo gene expression matrix. Would it have a negative effect on my analysis? As a beginner, I look forward to your response.

jphe commented 1 month ago
  1. Please check if your GTF file includes MT genes as scTE extracts gene names from this file, please using an ENCODE-formatted GTF document.
  2. RNA sequencing analysis provides relative quantification, and discrepancies in mapping reads to genes between tools are common. These differences shouldn't significantly impact your analysis results.
synnimeng commented 1 month ago

@jphe Thanks for your early reply! I built genome indices using the command as recommended:

$ scTE_build -g hg38 # Human

So the GTF file is: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz

There must be MT genes in GTF. Then I checked my result .csv files. I found MT genes in columns, but all these genes express 0 counts, so they were pre-filtered by min.cells = 3 when I created a seurat object.

I also checked the count matrix files generated by STARsolo, where MT genes expression is not 0.

Taken scTE --expect-cells 20000 into consideration, I don't think all the cells in scTE .csv files have such high quality. But I still couldn't figure out what's wrong with it.

synnimeng commented 1 month ago

Now I think I probably found the reason. In my scTE running log files, I found an error as follows:

INFO    : Processing BAM/SAM files ...2024-06-16 14:35:10
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
INFO    : Input SAM/BAM file appears to be valid
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error 4 after 0 of 4 bytes
samtools view: error reading file "/pancreas/clean_bam/PDAC1.clean.bam"
samtools view: error closing "/pancreas/clean_bam/PDAC1.clean.bam": -1
INFO    : Done BAM/SAM files processing ...2024-06-16 15:17:16 

Therefore, there may be something wrong when I filtered the bam files.

st-tky commented 1 month ago

I have the same problem above. Although mitochondrial genes can be detected in cellranger with the same reference, no mitochondrial genes were detected after scTE. In addition, my custom gene "Cre-tdTomato" was also not detected. Does the scTE pipeline have a function that filters out genes with special characters, such as hyphens? I would appreciate it if anyone had troubleshoots.

synnimeng commented 1 month ago

I have the same problem above. Although mitochondrial genes can be detected in cellranger with the same reference, no mitochondrial genes were detected after scTE. In addition, my custom gene "Cre-tdTomato" was also not detected. Does the scTE pipeline have a function that filters out genes with special characters, such as hyphens? I would appreciate it if anyone had troubleshoots.

Hi. I have solved this problem after restoring my input bam files. Please use samtools qucikcheck to check if there's something wrong with your bam file. Before scTE, I used samtools view starsolo.bam -h | awk '/^@/ || /CB:/' | samtools view -h -b > clean.bam to filter out reads with CB. clean.bam was my input for scTE, which met the error I mentioned above. Although scTE could run, the result files were not entire, with some counts not detected. I recommend you check your scTE log file first. Are there any error prompts?

st-tky commented 1 month ago

Thank you for your quick and kind reply. I followed your instructions and found no errors in my scTE log, but mitochondrial genes and my additional custom gene are still missing. I am using bam file created by cellranger, and the seurat object created from h5 file by scTE looks OK except for those missing genes. For my information, did you find mitochondrial genes in your data after scTE by the above workaround?

synnimeng commented 1 month ago

@st-tky I created seurat objects from scTE csv files and I didn't miss any gene annotated in gtf. I think the step for seurat is ok. Can you find "^MT-" or "^mt-" in your gene list? Or there are mitochondrial genes but no counts are detected? If there's no mitochondrial gene in gene list, you could check if your gtf contains mt-genes and "Cre-tdTomato" and if your scTE index is correct.

st-tky commented 1 month ago

@synnimeng Thank you for your kind follow-up. I found not "^mt-" genes in the gene list of my seurat object. I checked my scTE index and my gtf file, and I found "mt-" genes only in my gtf file but not in scTE index. Thus, I assume that mitochondrial genes were lost when creating a custom scTE reference from gtf file and TE bed file. Because I am using custom gtf file created by cellranger-mkref (from fasta file and gtf annotation file), I am wondering which gtf file I should use: gtf file before/after cellranger-mkref, or something else?

synnimeng commented 1 month ago

I'm not very familiar with cellranger-mkref, but you can consider whether gtf file changes before and after this step. If you filter/add some genes in your gtf, you should use the modified gtf.

@synnimeng Thank you for your kind follow-up. I found not "^mt-" genes in the gene list of my seurat object. I checked my scTE index and my gtf file, and I found "mt-" genes only in my gtf file but not in scTE index. Thus, I assume that mitochondrial genes were lost when creating a custom scTE reference from gtf file and TE bed file. Because I am using custom gtf file created by cellranger-mkref (from fasta file and gtf annotation file), I am wondering which gtf file I should use: gtf file before/after cellranger-mkref, or something else?

st-tky commented 1 month ago

I was able to detect mitochondrial genes with default reference. Thus, I believe my workflow to add my custom gene to the reference is the cause. I will continue to work around this point, but I might eventually consider using other pipelines. Anyway, I really appreciate your kind help!

I'm not very familiar with cellranger-mkref, but you can consider whether gtf file changes before and after this step. If you filter/add some genes in your gtf, you should use the modified gtf.

@synnimeng Thank you for your kind follow-up. I found not "^mt-" genes in the gene list of my seurat object. I checked my scTE index and my gtf file, and I found "mt-" genes only in my gtf file but not in scTE index. Thus, I assume that mitochondrial genes were lost when creating a custom scTE reference from gtf file and TE bed file. Because I am using custom gtf file created by cellranger-mkref (from fasta file and gtf annotation file), I am wondering which gtf file I should use: gtf file before/after cellranger-mkref, or something else?