JiekaiLab / scTE

MIT License
87 stars 27 forks source link

Requires python >= 3.9, not 3.6 #89

Open dannyconrad opened 2 months ago

dannyconrad commented 2 months ago

Installed scTE in a Conda environment but kept getting problems until I upgraded to python 3.9 When python=3.6, I couldn't install a recent enough version of scipy (required scipy >= 1.8, but 1.8 depends on python >= 3.8) When python=3.8, scTE wouldn't run at all and kept giving the following error traceback:

/opt/miniconda3/envs/scTE/bin/scTE:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').run_script('scTE==1.0', 'scTE')
^[[A^[[B^[DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
DEBUG   : Creating converter from 7 to 5
DEBUG   : Creating converter from 5 to 7
Traceback (most recent call last):
  File "/opt/miniconda3/envs/scTE/bin/scTE", line 4, in <module>
    __import__('pkg_resources').run_script('scTE==1.0', 'scTE')
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/pkg_resources/__init__.py", line 722, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1561, in run_script
    exec(code, namespace, namespace)
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/EGG-INFO/scripts/scTE", line 13, in <module>
    from scTE.base import *
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/scTE-1.0-py3.8.egg/scTE/base.py", line 16, in <module>
    import anndata as ad
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/__init__.py", line 24, in <module>
    from ._core.anndata import AnnData
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/_core/anndata.py", line 31, in <module>
    from .. import utils
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/utils.py", line 13, in <module>
    from ._core.sparse_dataset import BaseCompressedSparseDataset
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/_core/sparse_dataset.py", line 29, in <module>
    from anndata._core.index import _fix_slice_bounds
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/_core/index.py", line 13, in <module>
    from ..compat import AwkArray, DaskArray, Index, Index1D
  File "/opt/miniconda3/envs/scTE/lib/python3.8/site-packages/anndata-0.10.7-py3.8.egg/anndata/compat/__init__.py", line 29, in <module>
    Index = Union[Index1D, tuple[Index1D, Index1D], spmatrix]
TypeError: 'type' object is not subscriptable

This Reddit thread pointed out that the tuple[] syntax only works starting with python=3.9.

OwenHoare1989 commented 1 month ago

Hi @dannyconrad

Your absolutely right. I had exactly the same issue. I don't know if you could help with my below question?

I am just running the scTE pipeline because I am very interested in studying transposable elements (TEs) in scRNA-seq data. Briefly, I used cellranger pipeline with a custome build human genome reference with TEs present in my GTF file. I use almost exactly the same STAR parameters with cellranger as was used in the scTE manuscript using the soloSTAR to allow for multi-mapping.

I used this BAM file from cellranger after allowing for multi-mapping and my generated custome hg38 index file to run the following command below.

scTE -i inp.bam -o out -x mm10.exclusive.idx -CB False -UMI False

The output seems to be only a csv file with genes and TEs. So my question is rather naive but how do I go from this output to a Seurat object. Cellranger outputs 3 files (genes, barcodes and matrix) which is needed to generate a Seurat object using the Seurat R package.

I mean am I supposed to use this 'out.csv' file from the scTE pipeline to merge to my Seurat object. I really don't understand. After alignment with TEs for bulk RNA-seq I can do feature counting and allow for multi-mapping. However, cellranger does not allow for this and this is why I was hoping this pipeline gives an interpretable output usable by Seurat. Am I missing something.

Could you please explain what I am supposed to use from this pipeline to obtain a Seurat object. I am very grateful for any help you can provide. If something is not clear please don't hesitate to ask.

Thanks

dannyconrad commented 1 month ago

Hi @OwenHoare1989, you can definitely create a Seurat object from the output. With the caveat that I never got scTE working with my data, I believe the out.csv is supposed to replace the original Cell Ranger outputs, so you don't merge with an existing Seurat object, instead you create an entirely new one. This thread goes into detail how to do it. Read your CSV into R and prune any genes/cells you don't want, then you can input the matrix into CreateSeuratObject() and proceed with the same analytical steps you would have used with your old Seurat object.

You should also doublecheck your work, unless there was a typo in your comment, because I see that you ran cellranger with a custom hg38 (i.e. human) reference, but your scTE command uses an index labeled mm10 (i.e. mouse). If you mistakenly mixed your reference genomes I don't believe your output data will be interpretable. Also -CB False and -UMI False seem like incorrect choices if you're analyzing 10x scRNA-seq data, but I suppose I don't know what kind of data you're working with.

OwenHoare1989 commented 1 month ago

Hi @dannyconrad

Thanks so much for your prompt reply. This was not the exact code I used. I used one specifically for my analysis using hg38. The code above in my previous message was just an example from the scTE pipeline.

As for converting the out.csv file into a Seurat object thanks I really appreciate you sending me the thread this will be very useful and it's great to know it can be done.

I have 1 final naive question. Is it normal that my out.csv file has only genes and TEs because I would have expected that there would be cell barcodes and UMI counts present in that file?

Also the csv file if I just open it on my computer is going horizontally and not vertically. I would have expected 3 columns in that file such as TEs/genes, cell barcodes and UMI to convert to Seurat. Perhaps I am mistaken.

When I run the code scTE -i inp.bam -o out -x hg38.index.idx -CB False -UMI False. If I don't keep the -CB and -UMI options as false the pipeline doesn't work. Are the UMI and cell barcodes then missing from the file?

Again I use the BAM file from cellranger as my input after multi-mapping. If you could just clarify what the out.csv file should look like that would be really helpful 😊 Thanks again so much for you help.

Kind regards Owen

dannyconrad commented 1 month ago

The out.csv file should be genes+TEs as columns and cell barcodes as rows, with the count of each gene/TE per cell populating the matrix. So yeah if all you see is a row of gene/TE names then its essentially an empty matrix and the analysis failed.

First and foremost would be because you have -CB = False. If you look at the README, setting -CB to False will make scTE try to find the cell barcode in the name of your BAM file, which is only appropriate if you're running multiple BAMs at once where each BAM corresponds to a cell or sample. If you're working with 10x scRNA-seq data, then you'll need to have the -CB flag equal "CB" or "CR", and the -UMI flag equal "UB" or "UR".

I can't really help you troubleshoot this beyond that because as mentioned I never got the RNA version of the pipeline to work (perhaps similar reason to why it was failing for you when -CB wasn't False), and while I got the scTEATAC pipeline to successfully complete, the results seemed meaningless.

OwenHoare1989 commented 1 month ago

Thanks @dannyconrad

That makes sense. I'll try a few things tomorrow but if it doesn't work then I will have to try another tool. Have a good one.

Owen

jphe commented 1 month ago

@dannyconrad You can try with python3.10, it works in my PC

jphe commented 1 month ago

@OwenHoare1989

Yes, as @dannyconrad mentioned above, you can not set the -CB -UMI option to False for scRNA-seq data. If you can not run scTE with the -CB "CB" or "CR", and the -UMI with "UB" or "UR" option correctly, then it means the input bam file has some reads has no cell barcode or UMI tags, you need to remove those reads before run scTE

jphe commented 1 month ago

@OwenHoare1989 You can refer to this issue(https://github.com/JiekaiLab/scTE/issues/5)

OwenHoare1989 commented 1 month ago

Thank you @jphe I will take that into consideration when running the analysis pipeline again next week!. Thanks to all on here and I am very gratefull for your help!!

OwenHoare1989 commented 1 month ago

Hi @dannyconrad and @jphe

I just wanted to say thank you both once again you were absolutely right about the cell barcodes. I filtered out the reads with no cell barcodes or UMIs and the pipleine ran and I could easily convert the hdf5 into a Seurat object. There are plenty of tools that can handle that and thankfully inside my Seurat object there is both genes and TEs, so I am delighted.

I have 1 or 2 last questions. I used cellranger to do the multi-mapping by modifying the toml file inside of the cell ranger pipeline. Here are the parameters I used.

star_parameters="--soloType=Droplet --soloFeatures=Gene --soloCBlen=14 --soloUMIstart=15 --outFilterMultimapNmax=5000 --winAnchorMultimapNmax=5000 --outSAMmultNmax=1 --twopassMode=Basic --readFilesCommand=zcat"

This worked really well and it does allow for multi-mapping. However, I could not get 2 parameters to works in this pipeline that you used to run soloSTAR. The 2 parameters that I decided to drop were --outSAMtype BAM SortedByCoordinate and --outSAMattributes NH HI AS nM CR CY UR UY

I dropped them because the cellranger pipeline kept crashing. However by default cellranger does use only the first 4 and not the last 4 double letters. --outSAMattributes NH HI AS nM

I just wanted to ask if these parameters are critical to use or not. If not then I will continue with my analysis. My first reaction is that they are not critical unless I am misunderstanding something.

My second question is should I extract the multi-mapped reads from the cellranger generated BAM file after allowing for multi-mapping. My understanding is that the scTE pipeline takes care of that?

I would be very grateful for any last comments on this topic.

Here is how I filtered the reads with no barcodes which got the scTE pipeline working

Filter the reads with no barcodes

samtools view possorted_genome_bam.bam -h | awk '/^@/ || /CB:/' | samtools view -h -b > possorted_genome_bam.clean.bam

Here is how I would extract the multi-mapped reads from my cellranger generated BAM file. Of note I have not done this yet because I thought it was unnecessary.

Find multi-mapped reads that map to more than one loci using the NH tag (which gives the number of loci the read can **map to), and also a MAPQ score not equal to 255.

samtools view -h possorted_genome_bam.bam | grep -E "^\@|NH:i:2" | awk 'BEGIN{FS="\t"} $5!=255' > multi_mapped_reads.sam**

Generate a multi-mapped BAM file

samtools view -S -b multi_mapped_reads.sam -o multi_mapped_reads.bam

If it is necessary I assume I would have to extract the multi-mapped reads and re-run the scTE pipeline?

Thanks again.

Owen

jphe commented 1 month ago

Q1: For STARsolo, I would suggest you set the --outSAMattributes All Q2: Yes, you don't need to extract the multi-mapped reads, you just need to make sure that you have kept the multiple mapped reads in the bam file for TE quantiication.

OwenHoare1989 commented 1 month ago

Thanks @jphe that's perfect. I have just recently done it and it is working great!!